PersistentEts

michalmuskala · March 2, 2017, 12:02pm

Another small library today.

Hex: persistent_ets | Hex
GitHub: GitHub - michalmuskala/persistent_ets

Ets table backed by a persistence file.

The table is persisted using the :ets.file2tab/2 and :ets.tab2file/3 functions.

Table is to be created with PersistentEts.new/3 in place of :ets.new/2. After that all functions from :ets can be used like with any other table, except :ets.give_away/3 and :ets.delete/1 - replacement functions are provided in this module. The :ets.setopts/2 function to change the heir is not supported - the heir setting is leveraged by the persistence mechanism.

Like with regular ets table, the table is destroyed once the owning process (the one that called PersistentEts.new/3) dies, but the table data is persisted so it will be re-read when table is opened again.

Example

pid = spawn(fn -> 
  :foo = PersistentEts.new(:foo, "table.tab", [:named_table])
  :ets.insert(:foo, [a: 1])
end)
Process.exit(pid, :diediedie)
PersistentEts.new(:foo, "table.tab", [:named_table])
[a: 1] = :ets.tab2list(:foo)

hubertlepicki · March 2, 2017, 12:53pm

Could you highlight for us how it differs from DETS and why would someone choose one against another?

michalmuskala · March 2, 2017, 1:17pm

With Dets every operation (read or write) hits the disk. For many application such a performance penalty (compared to ets) is not acceptable. Furthermore Dets tables are limited to 2GB. Dets doesn’t support the ordered_set table type either.

With PersistentEts, the table remains in memory, so all read and write operations have the same performance they would have with pure ets. Only periodically the table state is saved to a file. There’s also no file limit, besides the memory and disk limitations. Since it’s a regular Ets table, all types are fully supported.

DanCouper · March 2, 2017, 1:35pm

This is serendipitous; I’m prototyping something at the minute, and this fits the bill exactly. Wanted to have ETS tables that held a specific state for users while they were all connected that could easily be saved for recovery when users came back online (it’s a procedural generation toy, the ETS table provided ‘terrain’ that all users of the toy, and all their controlled processes, can access). Mnesia didn’t quite seem to fit the bill, seemed a bit of a faff, just wanted something brutally simple to get thing running quickly, so thanks for this

hubertlepicki · March 2, 2017, 1:48pm

Copy paste that to readme now.

OvermindDL1 · March 2, 2017, 3:28pm

Is there a benchmark of it compared to Mnesia with duplicate_bag tables using dirty read/writes (basically ETS that is DETS backed at that point) and similar settings for PersistentEts?

Does it only persist to disk ‘on occasion’ or after every write? Does it do it when the owner process is terminated? I’m guessing via the file2tab and such that it is serializing out the entire ETS table every write out instead of only the differences?

michalmuskala · March 3, 2017, 12:50pm

Performance of PersistentEts should be the same as performance of Ets itself - it is Ets.

It persists periodically - the default is every minute. It also triggers when the owner terminates. You can also trigger persistence manually.

Yes each time the table is fully dumped. In my quick tests, dumping a 10GB table takes 20-30s. A 100MB table takes ~200ms.

aseigo · March 3, 2017, 1:41pm

Data access, sure. But the performance of PersistentEts can not be the same as Ets: Ets does not include persistence. So while data access is surely the same via the :ets api, the overall performance of PersistentEts is still interesting imho…

I don’t find it easy to understand what those numbers mean without knowing the hardware involved, and also not overly meaningful without understanding how often persistence occurs and in which circumstances. Looking at the code, by default it writes the table to disk once per minute, regardless of changes to the table. (… as well as on table owner exit) Am I reading that correctly?

What is the intended use case for this? It can’t be for “valuable” data, as any changes to it within the persist timeout will be lost. It can’t be for large tables, since hitting disk for such a time period could be pretty undesirable if there is other I/O happening, not to mention messages to the PersistentEts process which may be piling up behind it?

Looking at the code, I’m also unsure what happens if it so happens that PersistentEts.new is called when another PersistentEts is busy writing out a table … ?

So … perhaps this is intended for “well behaved” applications with “small” datasets in ets tables?

I’m sure there must be a good use case for this, just trying to understand what it is. (So please don’t take the above as too critical … just walking through my thoughts as I look over the code)

JEG2 · March 3, 2017, 2:06pm

This persistence model is what Redis uses, so I would say you could use this in many scenarios where you would otherwise reach for Redis.

JEG2 · March 3, 2017, 2:19pm

The process is just for managing persistence. You don’t need to send messages to the process to interact with the table.

OvermindDL1 · March 4, 2017, 12:31am

That is basically how it is with mnesia as well, using dirty reads/writes is near the same as calling ETS straight, except it can still serialize the data out in the back-ground after a write instead of needing dumping periods.

Although could not get to 10gigs with it, at least on a 32-bit system. ^.^

lessless · May 3, 2022, 7:30pm

Does it mean that it can guarantee that an entry will be saved on disk if the insertion function returned ok?

How will it behave in the presence of always active clients?
It’s impossible to postpone shutdown indefinitely.

Exadra37 · May 3, 2022, 8:22pm

As far as I know nothing in Elixir that writes to the disk can give you that level of guarantee:

As per Erlang docs :

delayed_write

The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

I would love to be proved wrong

lessless · May 5, 2022, 6:09pm

Thank you for insights!

I tried to look into the sources of tab2file but didn’t have enough patience to dig deep down into what they’re using there for file manipulation
I just saw that they’re using message passing and that was enough for me to understand that there will definitely be overhead.