Unexpected behaviour with mnesia - not actually durable?

HammerActually · January 12, 2019, 2:43pm

I’m observing behavior with mnesia that is unexpected by me and could use some help understanding what’s going on. I’ve tested this with erlang 21.0.1 and 21.2.2, and elixir 1.7.4.

Given two scripts:

write.exs

IO.puts "Creating schema..."
:ok = :mnesia.create_schema([node()])
IO.puts "Starting mnesia..."
:ok = :mnesia.start()
IO.puts "Creating table..."
{:atomic, :ok} = :mnesia.create_table(:example, [{:disc_only_copies, [node()]}, type: :set, attributes: [:name, :value]])
IO.puts "Setting :foo = 1..."
{:atomic, :ok} = :mnesia.transaction(fn -> :mnesia.write({:example, :foo, 1}) end)
IO.puts "Looking up :foo..."
:mnesia.transaction(fn -> :mnesia.read({:example, :foo}) end) |> IO.inspect
IO.puts "Done."

read.exs

IO.puts "Starting mnesia..."
:ok = :mnesia.start()
IO.puts "Waiting until table is loaded..."
:mnesia.wait_for_tables([:example], 1_000)
IO.puts "Looking up :foo..."
:mnesia.transaction(fn -> :mnesia.read({:example, :foo}) end) |> IO.inspect
IO.puts "Done."

When I run these scripts:

rm -rf Mnesia.nonode@nohost; mix run write.exs; mix run read.exs

I observe the following:

Creating schema...
Starting mnesia...
Creating table...
Setting :foo = 1...
Looking up :foo...
{:atomic, [{:example, :foo, 1}]}
Done.
Starting mnesia...
Waiting until table is loaded...
dets: file "Mnesia.nonode@nohost/example.DAT" not properly closed, repairing ...
Looking up :foo...
{:atomic, []}
Done.

The primary behavior that is unexpected to me is that writing to a disc only table inside a transaction doesn’t seem to ensure that the record is actually written to disc before moving on after the transaction. ie, it’s not actually durable. I do observe the dets warning that the table file was not properly closed, which I presume is because the script terminated abruptly (though normally).

If I explicitly call :mnesia.stop before the end of the write.exs script, the record is actually persisted and successfully read by the read.exs script, however I expect that I should be able to trust that once the transaction returns, the record is safely persisted. I can include :mnesia.stop in my shutdown procedures, but it does not appear that it’s actually durable in the case where the application crashes or is halted via kill -9.

If instead of stopping :mnesia I sleep for a few seconds at the end of write.exs, the dets warning is still emitted but the record is successfully found. So it does eventually write the record to disc without an explicit :mnesia.stop, but just not before the :mnesia.transaction call returns, which very clearly to me seems to conflict with what is said here: http://erlang.org/doc/apps/mnesia/Mnesia_chap4.html#durability

I’ve observed the same behavior with :disc_copies and using :mnesia.sync_transactionas well, just to be thorough. The following is the output from:mnesia.system_info(:all)`, in case it’s useful. It’s all default values afaik. (paths have been truncated by me)

[
  access_module: :mnesia,
  auto_repair: true,
  backend_types: [:ram_copies, :disc_copies, :disc_only_copies],
  backup_module: :mnesia_backup,
  checkpoints: [],
  db_nodes: [:nonode@nohost],
  debug: :none,
  directory: 'Mnesia.nonode@nohost',
  dump_log_load_regulation: false,
  dump_log_time_threshold: 180000,
  dump_log_update_in_place: true,
  dump_log_write_threshold: 1000,
  event_module: :mnesia_event,
  extra_db_nodes: [],
  fallback_activated: false,
  held_locks: [],
  ignore_fallback_at_startup: false,
  fallback_error_function: {:mnesia, :lkill},
  is_running: :yes,
  local_tables: [:schema, :example],
  lock_queue: [],
  log_version: '4.3',
  master_node_tables: [],
  max_wait_for_decision: :infinity,
  protocol_version: {8, 3},
  running_db_nodes: [:nonode@nohost],
  schema_location: :opt_disc,
  schema_version: {2, 0},
  subscribers: [#PID<0.141.0>],
  tables: [:schema, :example],
  transaction_commits: 2,
  transaction_failures: 0,
  transaction_log_writes: 0,
  transaction_restarts: 0,
  transactions: [],
  use_dir: true,
  core_dir: false,
  no_table_loaders: 2,
  dc_dump_limit: 4,
  send_compressed: 0,
  version: '4.15.5'
]

tty · January 12, 2019, 8:38pm

Like most other dbases Mnesia’s disc_only_copies / disc_copies writes to a write-ahead-log. The abrupt termination didn’t allow it to sync to disc properly causing the error you see.

Unfortunately there is not really much you can do to avoid this situation. Even the “robust” Oracle ended up with a very corrupted dbase after dripping water from condensation fried the server.

In general a kill -15 should be attempted first to allow the application a chance to cleanup before a kill -9.

Exadra37 · December 18, 2020, 10:08am

While I don’t recall to see any warning is it possible that a not proper closed table can cause the lost of all records on it?

I have this happen to me after I have forced the running application to terminate with ctrl + c + c:

Exadra37 · December 18, 2020, 11:16am

Extracted from Mnesia And The Art of Remembering | Learn You Some Erlang for Great Good!

Screenshot from 2020-12-18 10-11-55

Maybe @ferd can shed us some light why this isn’t the case of being 100% guaranteed in your use case?

Exadra37 · January 4, 2021, 12:19pm

So this lost of data when you kill Mnesia is not due at all to Mnesia itself, but instead to how the file is open to write to the disk,

Please see this post that also as a script demonstrating the same issue when writing directly to the disk from a script:

HammerActually · January 5, 2021, 9:48pm

After originally posting here, I dove into the Erlang source code and posted on the Erlang bug tracker: https://bugs.erlang.org/browse/ERL-831

I’m pretty far removed from this issue at the moment, so may not have much further context to share, but I believe between this thread and the one in the bug tracker, everything I observed has been captured, in case it’s useful to anyone. Looking at the Erlang source code is likely provide a lot of clarity around where any blocking actually occurs, and echoing what has been said by a few folks in a few different places, it isn’t always what you might assume or expect in regards to durability, based on the documentation.

Exadra37 · January 5, 2021, 10:50pm

The issue It’s not related with any blocking in Erlang per se… As I mentioned in the post I linked previously, it’s because the BEAM has a default of 2 seconds to write data to the disk when the file is open with the defaults for delayed_write, as per docs :

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

You can confirm this with a script that writes directly to the file system:

defmodule FileIO do
  
  @moduledoc """
  iex> fd = FileIO.open! "test.txt"                             
  #PID<0.292.0>
  iex> FileIO.append!(fd, "test it") && FileIO.read!("test.txt")
  "test it\n"
  iex> FileIO.close fd                                          
  :ok
  """

  @write_mode [:append, :binary]
  # @write_mode [:append, :binary, :delayed_write]
  # @write_mode [:append, :binary, {:delayed_write, 1, 1}]

  def open!(path, write_mode \\ @write_mode) do
    File.open!(path, write_mode)
  end

  def close(file_descriptor) do
    File.close(file_descriptor)
  end

  def append!(file_descriptor, data) do
    {:ok, start_position} = :file.position(file_descriptor, :cur)

    :ok = IO.binwrite(file_descriptor, "#{data}\n")

    {:ok, end_position} = :file.position(file_descriptor, :cur)

    %{
      file_descriptor: file_descriptor,
      start_position: start_position,
      end_position: end_position,
      size: byte_size(data)
    }
  end

  def read!(path) do
    {:ok, data} = :file.read_file(path)
    data
  end
end

So, I can confirm that the delay of 2 seconds is indeed coming from the BEAM when the file is open to write with delayed_write. If in the above script I use @write_mode [:append, :binary] or @write_mode [:append, :binary, {:delayed_write, 1, 1}] I can immediately read the content of the file after I write to it and see that my last write is persisted on disk, but if I use instead @write_mode [:append, :binary, :delayed_write] I cannot see the last write in the file, unless I wait 2 seconds to read it.

I totally agree with you here…

In my opinion calling Mnesia ACID, therefore durable is stretching a little the definition of ACID and durable, but it seems the reply is always replication, that then leads to netsplits, that then leads to data loss when recovering from it, thus the ACID definition is again being streched.