Mnesia with Memento lost all records for all tables

Yes, I know they use and I just pay for the ones that have them, but I am paying for a service, therefore I am not worried how much time it will last, provided that they replace them before it causes data corruption in my application :wink:

thats unfortunate :frowning:

The first thing that comes to mind is the node name changed, so the schema path changed. Like running
iex -S mix
then running
iex --sname test -S mix

The idea behind Mnesia is you MUST run it in at least 2 node configuration, as all writes go to both nodes, if 1 node has a power problem / CTRL+C+C, the other node will still receive the writes. (So 3 nodes total, your app + 2 noded mnesia). If you trip over the powercord to the rack hosting both nodes you are screwed and will lose (alot of) data.

Mnesia has no writeaheadlog like (all?) modern databases, it uses a log but its just an in-ram log that gets periodically flushed to disk. (This is why rocks_db backend for Mnesia is not a real solution to me, it doesnt add a WAL)

There was ways around it, you can set the logdumpthreshold to 1 millisecond, so mnesia will dump the log every 1ms. Or set it to 1 write (default is 1000 i think). But dumping this log is very cpu intensive and your tables will lock up as the dump is done, large tables will be rendered unwritable.

If you want to run Mnesia in a single local node configuration (just to store some persistent state for your app in a simple way), id run it with logdumpthreshold 1 write.

2 Likes

I am a bit late to the party here, but I think this is a wise idea. We tried to use :mnesia early on at CargoSense and while it isn’t intrinsically bad, it is so very easy to shoot yourself in the foot with it, and that isn’t even touching node split situations. :mnesia just doesn’t really reflect the set of trade offs for a datastore that I think people generally want in production, certainly for any kind of canonical data store.

1 Like

That was not for sure the case. Nothing at all have changed, not even the code.

If Mnesia cannot be use reliable with one single node it shouldn’t be allowed to even start or at least the docs could be very clear about that.

Yes, I saw in my research that doing it so it will render it unusable :frowning:

I am still playing with some scripts in my laptop but for now it seems that we have an underline issue somewhere in the BEAM, because each time I use the scripts I have included in previous posts to write directly to the filesystem I loose data, because the BEAM seems to have a delay of around 2 seconds in my Laptop to actually write the data to the actual disk, even when I tell Ubuntu to not cache writes.

Another possibility is that my laptop SSD firmware is cheating when Ubuntu as write cache disabled and ignores it.

The BEAM indeed has a default of 2 seconds to write data to the disk when the file is open with the defaults for delayed_write, as per docs:

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

Now I just need to do some tests with a simple Elixir script and see if this “insane” default can be optimized with {delayed_write, Size, Delay} from 2 seconds to something like 2 or 20 ms, because 2 seconds is an eternity and a lot of data can be lost.

No idea then.

Yea Mnesia needs a makeover for 2020 (kind of how :pg and :ets got a makeover), I think its a very powerful part of the Erlang ecosystem.

I was about to answer its near impossible to lose data (if no power cycle occurred, and the filesystem+version is stable+tested) if the write syscall got called and returned successfully. But if your using delayed_write you will lose data if the app dies yea. I found 8ms is pretty sane if spamming, 2ms even works well. But its not a solution again, you dont want to lose ANY data if the write call returns ideally, because other parts of your code start executing.

1 Like

Oh, you can still lose data, because the Operating System also uses a write cache, unless you disable it and the firmware for the SSD is not ignoring that setting and doing also some cache before doing the actual write to disk:


I am not using, but I guess that Mnesia is using it under the hood, but I have not looked into the Mnesia core code.

At some point you need to accept that you may lose data in a catastrophic failure, but I don’t want that happening due to my code or due to the way the BEAM works, unless I am explicitly accepting the risk in exchange of write speed. AS it stands now I can lose data because of the default settings of BEAM, plus the defaults of the Operating System.

What I need to find is the balance between all the bits involved :wink:

The answer is not in practise. Lets clear assumption before going forward.

#1 the drive in question is 100% to spec (meaning it doesnt say it has a feature but really does not, this is common on consumer drives).
#2 if HDDs are used (spinning rust) they are behind a hardware RAID controller with a battery-backed write cache. I don’t think any dedicated server providers exist these days that dont provide a hardware RAID controller.
#3 if SSDs or NVMEs are used they have supercapacitors to spec (#1).
#4 XFS filesystem (or EXT4 in data=writeback mode)

XFS filesystem does not have a OS writecache, it writes directly to the disk cache. In the case of a power outage, AND in the case of a failed battery (that the IPMI/smart tools did not pick up) / supercapacitor you can lose data. But the chances are much much slimmer that everything will fail up to this point.

Interesting It seems DETS uses delayed_write of

-define(DEFAULT_CACHE, {3000, 14000}). % cache_parms()

by default. So disk_only_copies I think uses DETS under the hood too.

Yea its annoying to deal with, and the smart ass answer is like run in a 99990 node cluster, that way you wont lose data, but that not a real answer.

I guess it comes down to do you need a KV store, or a full (relational) database. And if your answer is the latter, why do you need a relational database, most people jump to relational databases without a good reason, simply because “the internet told me to use POSTGRES”.

1 Like

No, it uses disk_log.

I am with you. I really think that saying that in the docs that Mnesia is ACID when it can only theoretically achieve that in distributed mode and when you don’t have a network split is pushing a little the ACID definition behind what it really is.

People normally go with the flow, be it in software development or in other disciplines.

In my case I am trying to just write an app that only uses what is included in the BEAM, so that I have no external dependencies. I want to put it in production and see how it goes. This is more like a challenge for me then the need to really use Mnesia.

1 Like

disc_copies uses disk_log. disc_only_copies uses dets if I am not mistaken.

1 Like

Yes, I think you are correct. I keep getting confused with the naming of disk_copies and disk_only_copies :frowning:

So I have been playing with this, using a script that writes directly to the disk, without going through Mnesia:

defmodule FileIO do
  
  @moduledoc """
  iex> fd = FileIO.open! "test.txt"                             
  #PID<0.292.0>
  iex> FileIO.append!(fd, "test it") && FileIO.read!("test.txt")
  "test it\n"
  iex> FileIO.close fd                                          
  :ok
  """

  @write_mode [:append, :binary]
  # @write_mode [:append, :binary, :delayed_write]
  # @write_mode [:append, :binary, {:delayed_write, 1, 1}]

  def open!(path, write_mode \\ @write_mode) do
    File.open!(path, write_mode)
  end

  def close(file_descriptor) do
    File.close(file_descriptor)
  end

  def append!(file_descriptor, data) do
    {:ok, start_position} = :file.position(file_descriptor, :cur)

    :ok = IO.binwrite(file_descriptor, "#{data}\n")

    {:ok, end_position} = :file.position(file_descriptor, :cur)

    %{
      file_descriptor: file_descriptor,
      start_position: start_position,
      end_position: end_position,
      size: byte_size(data)
    }
  end

  def read!(path) do
    {:ok, data} = :file.read_file(path)
    data
  end
end

I can confirm that the delay of 2 seconds is indeed coming from the BEAM when the file is open to write with delayed_write, that defaults to 64KB max size or 2 seconds, as per the Erlang docs:

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

If in the above script i use @write_mode [:append, :binary] or @write_mode [:append, :binary, {:delayed_write, 1, 1}] I can immediately read the content of the file after I write to it and see that my last write is persisted on disk, but if I use instead @write_mode [:append, :binary, :delayed_write] I cannot see the last write in the file, unless I wait 2 seconds to read it.

Just to recap, Mnesia when set to disc_copies is using disk_log Erlang library that opens the file with {:delayed_write, max_size, max_time}], thus opening the window to loose data.

So, the next steps is trying to find a commit time to disk that is as lower as possible, without affecting too much the throughput, but favoring consistency over write speed. When I find a good value that I am happy with, then I will configure Mnesia with it and see if it can cope well under load in the same way as my above script.

Any recommendation to benchmark my script and Mnesia?

Inspired by some of the discussions on this thread, I decided to play around with different failure modes with mnesia distribution: https://github.com/keathley/breaking_mnesia. I have a large test file that tries to explain my thought process. A cynical person could probably claim that these “failures” are either “working as intended” or “operator error”. My point wasn’t to try to say that mnesia is broken. My goal was to demonstrate the ways that Mnesia might be surprising if you’re expecting it to provide certain guarantees.

6 Likes

Many thanks for this repo. Awesome work :heart:

From your repo:

RabiitMQ seems to be developing an implementation of Mnesia with Raft Consensus:

From their intro:

This is an attempt to implement a highly consistent Mnesia transaction implementation on top of the Raft consensus algorithm.

The goal of this project is to remove Mnesia’s clustering and distribution layer due to it’s highly opinionated and flawed approach to recovery from network partitions.

This project is created for RabbitMQ and will mostly target features required by the RabbitMQ.

Did you already tried it out?

Nope. I knew they were working on it. But, Mnesia has such a narrow use case that I haven’t reached for it in my work. Maybe if you only have one node and know that you’ll only ever have one node? Adding consensus seems interesting, but it’s worth noting that you’ll also give up on a lot of performance as well. But, it’s probably worth it in general.

At work, we run a few different clustered applications, but in each case we’ve been able to get away with ETS and other processes for in-memory operations with postgres and other datastores for persistence. But, like I said, having a less surprising solution to these issues would be interesting.

If someone wanted to try out those same tests with mnevis it should be pretty straightforward.

1 Like

Well, that was my idea until I lost all my data. I was thinking in using only one node and have its folder backup for disaster recovery.

Another approach that I will take, independent of the database I will choose, is to use an append only log, that will be replayed when some thing goes wrong, like the database gets corrupted. So the append log will act as the source of truth for anything that happens on the application, like its done in Event Sourcing.

From my use case I want consistency over performance. Choosing Mnesia its just to have an app without external dependencies, not because of its performance to write and read.

Before I found Mnevis I was thinking in just wrapping the use of :ets with persistence into a distributed append log only, but now I ma more inclined to use Mnevis.

I totally second you here :slight_smile:

I will run them with Mnevis when I have a chance to try it out. Do you want a pull request into your repo with Mmevis tests?

Isn’t this basically what disc_copies does in Mnesia? It stores the database in ets, then builds a disk_log of transactions to restore it on restart.

1 Like

Well, I read that Mnesia tries to use the disk log when it restarts, and in a distributed system it can be the remote one or the local one, but the fact is that Mnesia wiped out my entire log on the disk in a single node configuration. I think this may have occurred when I started it after a ctrl + c + c, and on the time it was killed the application was not being used. So, maybe the log files where not closed properly and when Mnesia started it has failed to repair them and just got me a brand new empty log :rage:

At this point I lost all my confidence in Mnesia and I have serious doubts I will recover it, unless I am able to replicate my issue and understand exactly what was the cause of it.

Totall fair, of course. I wondering if you ran afoul of a failed auto repair on the next startup. That’s my current best guess, but I could be way off.

Mnesia should never drop my files if couldn’t repair them, they should have been renamed, so that I could try to recover them.

Whatever happened shouldn’t have occurred, but I am not able to reproduce it :frowning: