Mnesia with Memento lost all records for all tables

cmkarlsson · December 20, 2020, 2:57am

No, not in the sense of a traditional RDBMS. If you use sync_transaction you know that multiple nodes have received and handled the transaction but you still don’t know if they’ve flushed to disk. A server room outage could make you lose data here as well. But then again, if you lose enough disks in a raid you lose data too.

Exadra37 · December 20, 2020, 7:01pm

Disabling the Linux write cache could help here?

Not all system’s belong to the same “turn-on write-back caching” recommendation group as write-back caching caries a risk of data loss in the event such as power failure etc. In the event of power failure, data residing in the hard drive’s cache do not get a chance to be stored and a lost. This fact is especially important for database system. In order to disable write-back caching set write-caching to 0:

# hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching =  0 (off)

derek-zhou · December 21, 2020, 4:21pm

For modern SSDs it will be either:

disastrous performance degradation, or
Ok performance because the SSD vendor cheats, and it is still not durable

It is better to have explicit cache flush passed down from the stack.

Exadra37 · December 21, 2020, 10:26pm

Here I mentioning the possibility of disabling the write cache in the context of a server, not in our computers.

Would be nice that we could do this from our code

EDIT:

From the Erlang bug report ERL-831 it seems that :mnesia.sync_log() kind does the trick:

Calling mnesia:sync_log after the transaction does force an fsync and demonstrates the expected durability on the restart of mnesia. The documentation (Erlang -- mnesia) does imply that a window exists where data may not be durable. That detail seems buried in the documentation where someone may only notice it after discovering durability issues the hard way. It also seems to offer a narrower set of cases where data may be lost. In addition to power loss, the above scripts demonstrate that it can also happen when a process terminates normally. Though I have not tested it, I assume there are also crash scenarios where the same behavior would occur.

The link to sync_log docs also reveals that dump_log function can be used to trigger a user dump to disc, but I don’t understand what are the differences between the two, but I can confirm that both solve the issue I observe in write.exs and read.exs scripts I shared in post#14.

derek-zhou · December 21, 2020, 11:22pm

Servers use SSDs too. For any decent database-like workload SSD is a must. And yes, some vendors do cheat on usage of write buffers.

Exadra37 · December 22, 2020, 12:12am

Yes, I know they use and I just pay for the ones that have them, but I am paying for a service, therefore I am not worried how much time it will last, provided that they replace them before it causes data corruption in my application

thats unfortunate

vans163 · December 31, 2020, 2:52am

The first thing that comes to mind is the node name changed, so the schema path changed. Like running
iex -S mix
then running
iex --sname test -S mix

The idea behind Mnesia is you MUST run it in at least 2 node configuration, as all writes go to both nodes, if 1 node has a power problem / CTRL+C+C, the other node will still receive the writes. (So 3 nodes total, your app + 2 noded mnesia). If you trip over the powercord to the rack hosting both nodes you are screwed and will lose (alot of) data.

Mnesia has no writeaheadlog like (all?) modern databases, it uses a log but its just an in-ram log that gets periodically flushed to disk. (This is why rocks_db backend for Mnesia is not a real solution to me, it doesnt add a WAL)

There was ways around it, you can set the logdumpthreshold to 1 millisecond, so mnesia will dump the log every 1ms. Or set it to 1 write (default is 1000 i think). But dumping this log is very cpu intensive and your tables will lock up as the dump is done, large tables will be rendered unwritable.

If you want to run Mnesia in a single local node configuration (just to store some persistent state for your app in a simple way), id run it with logdumpthreshold 1 write.

benwilson512 · December 31, 2020, 2:59am

I am a bit late to the party here, but I think this is a wise idea. We tried to use :mnesia early on at CargoSense and while it isn’t intrinsically bad, it is so very easy to shoot yourself in the foot with it, and that isn’t even touching node split situations. :mnesia just doesn’t really reflect the set of trade offs for a datastore that I think people generally want in production, certainly for any kind of canonical data store.

Exadra37 · December 31, 2020, 12:40pm

That was not for sure the case. Nothing at all have changed, not even the code.

If Mnesia cannot be use reliable with one single node it shouldn’t be allowed to even start or at least the docs could be very clear about that.

Yes, I saw in my research that doing it so it will render it unusable

I am still playing with some scripts in my laptop but for now it seems that we have an underline issue somewhere in the BEAM, because each time I use the scripts I have included in previous posts to write directly to the filesystem I loose data, because the BEAM seems to have a delay of around 2 seconds in my Laptop to actually write the data to the actual disk, even when I tell Ubuntu to not cache writes.

Another possibility is that my laptop SSD firmware is cheating when Ubuntu as write cache disabled and ignores it.

Exadra37 · December 31, 2020, 2:43pm

The BEAM indeed has a default of 2 seconds to write data to the disk when the file is open with the defaults for delayed_write, as per docs:

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

Now I just need to do some tests with a simple Elixir script and see if this “insane” default can be optimized with {delayed_write, Size, Delay} from 2 seconds to something like 2 or 20 ms, because 2 seconds is an eternity and a lot of data can be lost.

vans163 · December 31, 2020, 7:55pm

No idea then.

Yea Mnesia needs a makeover for 2020 (kind of how :pg and :ets got a makeover), I think its a very powerful part of the Erlang ecosystem.

I was about to answer its near impossible to lose data (if no power cycle occurred, and the filesystem+version is stable+tested) if the write syscall got called and returned successfully. But if your using delayed_write you will lose data if the app dies yea. I found 8ms is pretty sane if spamming, 2ms even works well. But its not a solution again, you dont want to lose ANY data if the write call returns ideally, because other parts of your code start executing.

Exadra37 · December 31, 2020, 8:29pm

Oh, you can still lose data, because the Operating System also uses a write cache, unless you disable it and the firmware for the SSD is not ignoring that setting and doing also some cache before doing the actual write to disk:

Paulo Renato:

Disabling the Linux write cache could help here?

Not all system’s belong to the same “turn-on write-back caching” recommendation group as write-back caching caries a risk of data loss in the event such as power failure etc. In the event of power failure, data residing in the hard drive’s cache do not get a chance to be stored and a lost. This fact is especially important for database system. In order to disable write-back caching set write-caching to 0:
# hdparm -W0 /dev/sda
/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)

I am not using, but I guess that Mnesia is using it under the hood, but I have not looked into the Mnesia core code.

At some point you need to accept that you may lose data in a catastrophic failure, but I don’t want that happening due to my code or due to the way the BEAM works, unless I am explicitly accepting the risk in exchange of write speed. AS it stands now I can lose data because of the default settings of BEAM, plus the defaults of the Operating System.

What I need to find is the balance between all the bits involved

vans163 · December 31, 2020, 9:08pm

The answer is not in practise. Lets clear assumption before going forward.

#1 the drive in question is 100% to spec (meaning it doesnt say it has a feature but really does not, this is common on consumer drives).
#2 if HDDs are used (spinning rust) they are behind a hardware RAID controller with a battery-backed write cache. I don’t think any dedicated server providers exist these days that dont provide a hardware RAID controller.
#3 if SSDs or NVMEs are used they have supercapacitors to spec (#1).
#4 XFS filesystem (or EXT4 in data=writeback mode)

XFS filesystem does not have a OS writecache, it writes directly to the disk cache. In the case of a power outage, AND in the case of a failed battery (that the IPMI/smart tools did not pick up) / supercapacitor you can lose data. But the chances are much much slimmer that everything will fail up to this point.

Interesting It seems DETS uses delayed_write of

-define(DEFAULT_CACHE, {3000, 14000}). % cache_parms()

by default. So disk_only_copies I think uses DETS under the hood too.

Yea its annoying to deal with, and the smart ass answer is like run in a 99990 node cluster, that way you wont lose data, but that not a real answer.

I guess it comes down to do you need a KV store, or a full (relational) database. And if your answer is the latter, why do you need a relational database, most people jump to relational databases without a good reason, simply because “the internet told me to use POSTGRES”.

Exadra37 · December 31, 2020, 9:55pm

No, it uses disk_log.

I am with you. I really think that saying that in the docs that Mnesia is ACID when it can only theoretically achieve that in distributed mode and when you don’t have a network split is pushing a little the ACID definition behind what it really is.

People normally go with the flow, be it in software development or in other disciplines.

In my case I am trying to just write an app that only uses what is included in the BEAM, so that I have no external dependencies. I want to put it in production and see how it goes. This is more like a challenge for me then the need to really use Mnesia.

cmkarlsson · January 1, 2021, 3:52am

disc_copies uses disk_log. disc_only_copies uses dets if I am not mistaken.

Exadra37 · January 1, 2021, 2:33pm

Yes, I think you are correct. I keep getting confused with the naming of disk_copies and disk_only_copies

Exadra37 · January 4, 2021, 12:11pm

Paulo Renato:

Paulo Renato:

because the BEAM seems to have a delay of around 2 seconds in my Laptop to actually write the data to the actual disk, even when I tell Ubuntu to not cache writes.

Another possibility is that my laptop SSD firmware is cheating when Ubuntu as write cache disabled and ignores it.

The BEAM indeed has a default of 2 seconds to write data to the disk when the file is open with the defaults for delayed_write, as per docs :

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

Now I just need to do some tests with a simple Elixir script and see if this “insane” default can be optimized with {delayed_write, Size, Delay} from 2 seconds to something like 2 or 20 ms, because 2 seconds is an eternity and a lot of data can be lost.

So I have been playing with this, using a script that writes directly to the disk, without going through Mnesia:

defmodule FileIO do
  
  @moduledoc """
  iex> fd = FileIO.open! "test.txt"                             
  #PID<0.292.0>
  iex> FileIO.append!(fd, "test it") && FileIO.read!("test.txt")
  "test it\n"
  iex> FileIO.close fd                                          
  :ok
  """

  @write_mode [:append, :binary]
  # @write_mode [:append, :binary, :delayed_write]
  # @write_mode [:append, :binary, {:delayed_write, 1, 1}]

  def open!(path, write_mode \\ @write_mode) do
    File.open!(path, write_mode)
  end

  def close(file_descriptor) do
    File.close(file_descriptor)
  end

  def append!(file_descriptor, data) do
    {:ok, start_position} = :file.position(file_descriptor, :cur)

    :ok = IO.binwrite(file_descriptor, "#{data}\n")

    {:ok, end_position} = :file.position(file_descriptor, :cur)

    %{
      file_descriptor: file_descriptor,
      start_position: start_position,
      end_position: end_position,
      size: byte_size(data)
    }
  end

  def read!(path) do
    {:ok, data} = :file.read_file(path)
    data
  end
end

I can confirm that the delay of 2 seconds is indeed coming from the BEAM when the file is open to write with delayed_write, that defaults to 64KB max size or 2 seconds, as per the Erlang docs:

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

If in the above script i use @write_mode [:append, :binary] or @write_mode [:append, :binary, {:delayed_write, 1, 1}] I can immediately read the content of the file after I write to it and see that my last write is persisted on disk, but if I use instead @write_mode [:append, :binary, :delayed_write] I cannot see the last write in the file, unless I wait 2 seconds to read it.

Just to recap, Mnesia when set to disc_copies is using disk_log Erlang library that opens the file with {:delayed_write, max_size, max_time}], thus opening the window to loose data.

So, the next steps is trying to find a commit time to disk that is as lower as possible, without affecting too much the throughput, but favoring consistency over write speed. When I find a good value that I am happy with, then I will configure Mnesia with it and see if it can cope well under load in the same way as my above script.

Any recommendation to benchmark my script and Mnesia?

keathley · January 4, 2021, 2:45pm

Inspired by some of the discussions on this thread, I decided to play around with different failure modes with mnesia distribution: https://github.com/keathley/breaking_mnesia. I have a large test file that tries to explain my thought process. A cynical person could probably claim that these “failures” are either “working as intended” or “operator error”. My point wasn’t to try to say that mnesia is broken. My goal was to demonstrate the ways that Mnesia might be surprising if you’re expecting it to provide certain guarantees.

Exadra37 · January 4, 2021, 3:28pm

Many thanks for this repo. Awesome work

From your repo:

github.com

keathley/breaking_mnesia/blob/7946de35b8e427334fdcd0bc28be50eda254ad7d/test/break_mnesia_test.exs#L222-L224


      
          # point of failure across the cluster, or by implementing concensus. I've
          # built concensus algorithms before. Thinking of making that work reliably
          # with mnesia makes my head spin. Someone could do it, but its non-trivial.

RabiitMQ seems to be developing an implementation of Mnesia with Raft Consensus:

From their intro:

This is an attempt to implement a highly consistent Mnesia transaction implementation on top of the Raft consensus algorithm.

The goal of this project is to remove Mnesia’s clustering and distribution layer due to it’s highly opinionated and flawed approach to recovery from network partitions.

This project is created for RabbitMQ and will mostly target features required by the RabbitMQ.

Did you already tried it out?

keathley · January 4, 2021, 3:51pm

Nope. I knew they were working on it. But, Mnesia has such a narrow use case that I haven’t reached for it in my work. Maybe if you only have one node and know that you’ll only ever have one node? Adding consensus seems interesting, but it’s worth noting that you’ll also give up on a lot of performance as well. But, it’s probably worth it in general.

At work, we run a few different clustered applications, but in each case we’ve been able to get away with ETS and other processes for in-memory operations with postgres and other datastores for persistence. But, like I said, having a less surprising solution to these issues would be interesting.

If someone wanted to try out those same tests with mnevis it should be pretty straightforward.