Can I batch write large number of record to mnesia at the same time?

mnesia
ets

#1

I have some code that looks like this. What it does it takes series, of chunked data. So around 10-15 chunks of data. each chunk is 5000 items in length, and each item have 10 attributes.

This was my attempt to trying to save the data into Mnesia.

      data
  |> Enum.each(fn row ->
    row_data = [
      transaction_id
      | [event_id | Enum.map(attributes, fn attr -> row[Atom.to_string(attr)] end)]
    ]

    [source | row_data]
    |> List.to_tuple()
    |> :mnesia.dirty_write()
  end)

What I am finding out now is that it takes extremely long time for this to complete the entire operation. like 8-12 minutes to finish. First I suspect maybe because Mnesia write is using transactions and all the other complicated bells and whistles.

So I changed the last bit to use mnesia.ets/2, which bypasses the transaction and all the callbacks and redirectly write to ets, its slightly faster… but its still finishes in minutes.

  data
  |> Stream.each(fn row ->
    row_data = [
      transaction_id
      | [event_id | Enum.map(attributes, fn attr -> row[Atom.to_string(attr)] end)]
    ]

    processed_data =
      [source | row_data]
      |> List.to_tuple()

    :mnesia.ets(fn ->
      :mnesia.dirty_write(processed_data)
    end)
  end)
  |> Stream.run()

What Am i doing wrong here? I thought mnesia/ets is suppose to be fast? Could it because I am iterating through 5000 items one at a time and perform the write? and mnesia is backlogged on write?

Whats the best course of action here for me?


#2

If you simply write to stdout instead of writing the data to media, how fast are you then? If you not write at all?


#3

its much faster if i dont do the write to mnesia. What are you thinking?


#4
  • How many rows are you trying to insert? From the look of it 50_000 - 75_000 records with 10 attributes in each.
  • What is the size of the table?
  • What table copy type do you have? (ram_copies, disc_copies, disc_only_copies)
  • What type of table? (set, ordered_set, bag)
  • Is the table in use by many processes? I.e are there lots of concurrent inserts happening at the same time?

If the above is true it should not take minutes to complete. I did a quick test and inserted 100000 rows with random data into mnesia and it is quick. From less than a second to 6-7 seconds depending on type of transaction and table type.

When you are using dirty operations (such as dirty_write) you are not using transactions. It is true that transactions make things slower

It should be as fast as inserting into ets table. I suspect though that your code is not doing that. Especially if you are using the dirty functions, they will perform a dirty operation regardless what wrapping function (ets, transaction) you are using.

I.e :mnesia.ets(fn -> :mnesia:dirty_write(x) end)
and mnesia.transcation(fn -> :mnesia.dirty_write(x) end) will not do the ets and transaction but a dirty_write.

Instead I would recommend that you use mnesia:activity/3 . This lets you write your code as you would do it with transactions and then just change the type depending on what you need.

For example:

:mnesia.activity(:transaction, fn -> :mnesia.write(record) end, [])
:mnesia.activity(:async_dirty, fn -> :mnesia.write(record) end, [])
:mnesia.activity(:ets, fn -> :mnesia.write(record) end, [])

And also a word of caution is that ets mode should never be used. Especially not when doing writes. The default should be to use sync_transaction. This is the safest and also provides the best load handling.

It should still be fast. There is something else going on.

If your table is disc_copies you will see log messages like this:

=WARNING REPORT==== 5-Dec-2018::18:22:08 ===
Mnesia(n1@moana): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

If you see this then mnesia can’t keep up with the transaction load. This is OK for short periods of time but if you constantly get this bad things can happen (such as running out of memory, gradually slowing down and die and other things).

This can to some degree be tweaked with dump_log_write_threshold parameter but it will only get you so far.


#5

thanks for the detailed response!

  • How many rows are you trying to insert? From the look of it 50_000 - 75_000 records with 10 attributes in each.

I am trying to do 10k writes per event. Meaning that there will be consecutive independent events, that will come in with about 10k record, and I have to persist that into mnesia.

  • What is the size of the table?

The able is not that big, maybe tens/hundreds of megabytes?

  • What table copy type do you have? (ram_copies, disc_copies, disc_only_copies)

I thought mnesia auto assumes ram_copies, unless specified otherwise?

  • What type of table? (set, ordered_set, bag)

I am using bag for the record, does it have a performance hit because it have to check for dup tuples?

  • Is the table in use by many processes? I.e are there lots of concurrent inserts happening at the same time?

The table is being used by supervised process, no other process is actively using it, until the write is complete, and the consumer is then notified, so it can read the data back from mnesia.

So my understanding is that the dirty_* operations forgoes a bunch of checks involving data replication, sync , checkpoints and does not have transactions. But its still a serialized operation? Meaning that if i had 10000 record i need to insert, and I did it in a Enum.each function, each record has to be written, return before the next one can be written.

Even in the async_dirty operations documentation, it states

Calls the Fun in a context that is not protected by a transaction. The Mnesia function calls performed in the Fun are mapped to the corresponding dirty functions. This still involves logging, replication, and subscriptions, but there is no locking, local transaction storage, or commit protocols involved. Checkpoint retainers and indexes are updated, but they are updated dirty. As for normal mnesia:dirty_* operations, the operations are performed semi-asynchronously. For details, see mnesia:activity/4 and the User’s Guide.

so its not truely async meaning the write complete on its own, like If I were to iterate through the 10k records, and i do the following

       spawn(fn -> :mnesia.dirty_write(record) end)

this would effectively spawn 10k processes, each would go and write to mnesia, meaning there could be race conditions, and might corrupt the dataset.

I guess what I am asking is what happens between these two blocks of code

  Enum.each(fn (row) -> :mnesia.activity(:async_dirty, fn -> :mnesia.write(record) end, []) end) 

vs

  Enum.each(fn (row) -> spawn(fn -> :mnesia.activity(:async_dirty, fn -> :mnesia.write(record) end, []) end) end)

At the end of the day, If I need to persist large number of record. Whats the best way to do that? I’d imagine iterating over them 1 and at a time and performce the write is not the best option. How can I do better? How do u load 100k record in for your test?


#6

Yes

No to check for duplicates but it can be costly to insert many values into the same key. If you have lots of values per key (thousands) it starts becoming noticeable. Insertion time for bag is linear.

It is an atomic operation. Meaning that if written the data will not be garbage but there are no transactions and if run in a distributed setting there are no guarantees of replication.

The first one is OK the second one will cause heaps of concurrent writes into the database and I would not recommend it. I think the general recommendation is to limit the number of concurrent operations in mnesia to the number of schedulers.

For the number of values you have I’d do:

Enum.each(fn (row) -> :mnesia.activity(:sync_transaction, fn -> :mnesia.write(row) end, []) end)

I tried to insert 1M entries over 1000 keys (1000 entries per key) in a bag ram_copies table and it took 66 seconds (54 seconds for async_dirty)

Compared to inserting 1M values with unique keys which took 18 seconds (8.9 seconds for async_dirty)

So at this stage my best guess is that you have too many values per key. Another thing to contemplate is if you are using indexes? They are also of type bag and if they are not unique enough you run into the same problem as with too many values per key.