Question about memory copy

andzdroid · January 11, 2024, 1:18pm

I saw this bit of text on the cubdb repo’s readme (GitHub - lucaong/cubdb: Elixir embedded key/value database):

Snapshots come at no cost: nothing is actually copied or written on disk or in memory, apart from some small internal bookkeeping.

In relation to this code:

{x, y} = CubDB.with_snapshot(db, fn snap ->
  x = CubDB.Snapshot.get(snap, :x)
  y = CubDB.Snapshot.get(snap, x)

  {x, y}
end)

I looked at the with_snapshot function and all it does is make a genserver call. Can somebody explain how this doesn’t copy anything in memory?

  def snapshot(db, timeout \\ 5000) do
    GenServer.call(db, {:snapshot, timeout}, :infinity)
  end

My understanding of elixir was that messages across processes are always copied. Is my understanding wrong? Am I missing something?

LostKobrakai · January 11, 2024, 1:27pm

You’re correct. The message {:snapshot, timeout} is copied when being sent. That doesn’t mean the database itself (as in the data your stored in there) is copied when you do a snapshot though – which is what the documentation is trying to say.

andzdroid · January 11, 2024, 1:28pm

Wouldn’t the whole db be copied in the reply? The function is trying to get a snapshot of the whole database, which is sent back by the genserver call.

LostKobrakai · January 11, 2024, 1:41pm

That expects that “the snapshot” is “the data you store in the db”. It probably isn’t. It’s likely just a pointer or reference to the actual data stored elsewhere.

joey_the_snake · January 11, 2024, 1:44pm

If it was a pointer or a reference then it wouldn’t really be a snapshot because then you could see other people’s modifications. Otherwise it would have to be a pointer or a reference to a copy.

I believe the docs are implying no copying outside of data transfer between processes.

LostKobrakai · January 11, 2024, 1:53pm

That depends on the datastructure used and what information makes up the pointer. E.g. if the datastructure is an append only log you can create a “snapshot” just by having the id of the last item in the log, which is still part of the snapshot. CubDB afaik uses some kind of tree structure to store data.

joey_the_snake · January 11, 2024, 2:35pm

Looking at the source, it seems like a tree structure is copied to the caller. And this tree structure is walked when trying to read particular keys.

LostKobrakai · January 11, 2024, 2:38pm

Yeah, but that tree structure does in turn refer to a data store and that data store (file by default) holds the actual data. So yes, some stuff is copied (the internal bookkeeping mentioned), but it’s not the database nor the values it stores.

joey_the_snake · January 11, 2024, 2:48pm

Yeah true. I do see the OP’s point though. I wouldn’t necessarily expect the full tree structure, even if it doesn’t have all the data, based on the description.

LostKobrakai · January 11, 2024, 2:51pm

I’d imagine that in the grand scheme of things the tree is probably tiny compared to what would need to be moved around if values would be affected – especially when the db reaches sizes where you actually start caring for it not being copied.

lucaong · January 11, 2024, 8:46pm

@LostKobrakai is essentially right.

CubDB uses an append-only Btree: changes are appended to the database file, like in a log file. Basically, every new change causes a new Btree root to be appended to the database file. The root points at the old nodes as well as to some new ones containing the data that changed. Existing Btree nodes are never updated in place. It works very similarly to an immutable data structure, like the ones used by Elixir. This means that if one has a reference to an old Btree root, they can traverse the tree and see the database exactly as it was when that root was the current one.

Obtaining a snapshot then is as simple as holding a reference to a Btree root: new changes can be written to the database, but they won’t change what is pointed by that old root. Therefore, there is no database copying, as in CubDB does not have to make a copy of the database to give you an immutable snapshot. The Btree root is a small data structure, only holding a reference to other nodes down the tree: it doesn’t contain any data. Taking a snapshot is a very fast O(1) operation. In fact, it is internally used on every read operation, to ensure consistency in presence of concurrent writes.

The bookkeeping mentioned is simply because occasionally, in order to avoid growing the database file indefinitely as more changes accumulate, CubDB performs a compaction, which discards the data which is old and not referenced anymore. CubDB has to keep track of what’s referenced by the existing snapshots to avoid removing it, a bit like a garbage collector.

The details of compaction are a bit more involved, as compaction can proceed in parallel with reads and writes, but in essence when you take a snapshot CubDB does not copy any data: it merely remembers which Btree root your snapshot points to.

When you read from the snapshot, of course data will be copied from disk into memory. But that’s just the data you read, not the whole database. And, again, that is when you read from the snapshots, not when you take a snapshot.

I hope this clarifies it

andzdroid · January 11, 2024, 9:04pm

Thank you for the detailed explanation!

lucaong · January 12, 2024, 9:26am

Also, more clarifications regarding @joey_the_snake replies:

If it was a pointer or a reference then it wouldn’t really be a snapshot because then you could see other people’s modifications. Otherwise it would have to be a pointer or a reference to a copy.

No: it is a reference to the original data, a simple pointer, but it is isolated from other updates. The trick is that the original data is immutable, it cannot be changed by other updates. Instead, updates create a new Btree root that points to all the data that did not change, plus the part that changed. It is the same principle underlying immutable data structures with structural sharing (like those in Elixir, Clojure, etc.)

Looking at the source, it seems like a tree structure is copied to the caller. And this tree structure is walked when trying to read particular keys.

Not the whole tree, just the root, which is a small data structure. And yes, the tree is walked when accessed, but it is not a copy: it is the original (immutable) tree, but accessed from the snapshot root.

Yeah true. I do see the OP’s point though. I wouldn’t necessarily expect the full tree structure, even if it doesn’t have all the data, based on the description.

It’s not the full tree. You can think about it with an analogy with Elixir immutable data structures:

# Say we have a map, which in our analogy is the database
m = %{ foo: 1, bar: 2, baz: [1, 2, 3] }

# I take a "snapshot" by saving a reference to the current map
s = m

# Note that nothing was copied above. Simply, a new reference to the
# same (immutable) data structure is created

# This "snapshot" is isolated from changes:
m = Map.put(m, :bar, 3)
m # => %{ foo: 1, bar: 3, baz: [1, 2, 3] }
s # => %{ foo: 1, bar: 2, baz: [1, 2, 3] }

Note the part that is in common between s and m was not copied: both s and m point to the same values (for example the key :baz points to the same list in m and s, not to two identical copies). Yet, because maps are immutable in Elixir, s does not see changes. Simply, m tracks the latest updates, while s keeps pointing to a certain version.

As soon as s is not in scope anymore, the garbage collector can then clean up the old data that is not reachable anymore from m.

CubDB works in much the same way, just on disk instead than in memory, and therefore with some different implementation details to account for the difference performance characteristics of disk vs. memory.

I just found this nice explanation of how an append-only Btree works, which corresponds to how CubDB works internally. On top of that, CubDB uses a compaction process to “garbage collect” disk space and avoid growing indefinitely with each update.

I hope this clarifies it further

joey_the_snake · January 12, 2024, 9:32am

Thank you for the explanations you’ve provided in this thread. What you’ve created is really interesting.

I think the part I was getting tripped up over is that it seems like the btree is returned from a genserver to the caller. And it looked to me like the btree contains the root, which in return has a list of references to children. I thought all of these would have to be deep copied to the caller from the genserver.

lucaong · January 12, 2024, 3:23pm

And it looked to me like the btree contains the root, which in return has a list of references to children. I thought all of these would have to be deep copied to the caller from the genserver.

Good point. The reason why the subtree referenced by the root does not get copied to the GenServer caller is that those references are not standard memory references, but rather pointers to an offset in the database file. CubDB knows how to follow such pointers on demand to access the file in the correct locations when reading and traversing the tree.