Ecto_foundationdb - An Ecto Adapter for FoundationDB

jstimps · July 9, 2025, 8:11pm

This is still possible with EctoFDB, just not with Versionstamps. Using :binary_id works the same as traditional Ecto.

The async is only there to match libfdb_c’s futures, allowing the caller to only wait on network when they deem necessary.

Repo.transaction(fn ->
  f1 = Repo.async_get_by!(User, name: "Alice")
  f2 = Repo.async_get_by!(User, name: "Bob")
  [alice, bob] = Repo.await([f1, f2])

  Repo.insert!(%Team{members: [alice.id, bob.id]})
end,
prefix: tenant)

Functionally equivalent, but now with 2 network waits:

Repo.transaction(fn ->
  alice = Repo.get_by!(User, name: "Alice")
  bob = Repo.get_by!(User, name: "Bob")

  Repo.insert!(%Team{members: [alice.id, bob.id]})
end,
prefix: tenant)

garrison · July 9, 2025, 9:45pm

That first one doesn’t seem like a big deal to me (who does that), but performance could be an issue yeah. Worst case I can think of is someone inserting one versionstamped row and then selecting 1000 rows of something else and returning them. The question is whether that happens often enough to be a problem or if you can just throw an escape hatch in the API for the edge cases. I’ll have to think about it more.

This remains an open question. I haven’t started work on any high-level APIs yet as I am still busy reinventing the universe, but I am at a stage where I need to be thinking about these things. I don’t like dependencies so I’ll probably lean towards reimplementing them from scratch.

Phoenix integration could be an issue but I think the form handling is probably extensible enough to make something work. Honestly I don’t really like forms anyway, especially complex forms. It’s usually better to mutate the DB state directly rather than try to shove a complex interaction through the bottleneck of a single POST. The changeset relation APIs terrify me, I never touch them.

Not to spoil my own (eventual) announcement, but I have been slowly cloning FDB from scratch (this is probably somewhat obvious from my posts lol). The goal is to have an (ahem) foundation on which to build higher-level databases like a relational DB, taking inspiration from record layer and so on. I was very sad to see record layer has become an SQL DB, there were some good ideas in there that they are now discouraging (polymorphic relations). Protobufs are a terrible idea, though.

Honestly I just want a database I can trust. A while back my plan was to use Postgres and migrate to CockroachDB in the future; but they rugpulled it. When I started I never thought I would get as far as I have, but I could not have chosen a better database to learn distsys from. FDB is a masterpiece. Its only sin was choosing C++.

jam · July 11, 2025, 2:40pm

Nice! This looks really good. Any idea of what kind of scale can you get with this sync approach? It reminds me a bit of mongo change streams which last time I checked had some limitations to the point where I think they’d need to be paired with a pubsub system.

Hmm, what do you think of something like this as a way to deviate less? Not sure if it’s possible, but if it is then maybe the async_* functions wouldn’t be needed.

Repo.transact(fn ->
  futures = Repo.async(fn -> [
    Repo.get_by!(User, name: "Alice"), 
    Repo.get_by!(User, name: "Bob")
  ] end)

  [alice, bob] = Repo.await(futures)

  Repo.insert!(%Team{members: [alice.id, bob.id]})
end,
prefix: tenant)

jstimps · July 11, 2025, 3:14pm

Short answer is I don’t know yet, but here’s some info from the FDB docs that might steer a future benchmark / load test.

The first limitation we’d hit at scale is the client-side default limit of 10,000 pending watches. This can be increased with a config change.

By default, each database connection can have no more than 10,000 watches that have not yet reported a change. When this number is exceeded, an attempt to create a watch will raise a too_many_watches exception. This limit can be changed using Database.options.set_max_watches(). Because a watch outlives the transaction that creates it, any watch that is no longer needed should be cancelled by calling Future.cancel() on its returned future.

FDB Python Docs | Watches

Also,

Storage servers have their own limits, and when they are exceeded they result is that the client falls back to automatic polling. This all happens transparently.

FDB Forums | Limits on Watches

I’m less familiar with Mongo, but I have used RethinkDB’s changefeeds, with frustrating results. I believe the key difference with FDB is that the watch itself does not distribute data to the listener. It’s only a signal that something has changed. With this approach, the watch is conceptually merely an optimization to polling.

FDB Docs | How Watches Work

Interesting idea! I think that would be a fairly straightforward change. I’ll put it in the GitHub as a potential enhancement. I would definitely welcome a PR, but also it’s something that I could get around to fairly quickly. Thanks for the idea!

garrison · July 11, 2025, 3:51pm

Honestly FDB watches are a tragedy, such wasted potential. If you want to implement live queries properly it is extremely important that you don’t miss any changes. You always want to observe a consistent prefix of the database state (I have seen this referred to as “internal consistency”) so that you don’t get torn updates or just straight up wrong data (missing a row in a collection).

In order to properly stream changes to a keyspace you need to read at a given version and subscribe at that version atomically so that you don’t miss any changes. You also need to guarantee that all mutations in the range are received. Watches make neither of these guarantees!

Also, they don’t even let you watch a range at all, only individual keys.

What hurts is FDB is architecturally so amenable to doing this properly. Every mutation is versioned and the MVCC architecture ensures you can read+subscribe at a given version and immediately receive all mutations since that version since they are already in memory. All you need to do is scan the in-memory MVCC store for that range, return those mutations immediately, and then add an entry in an interval tree or similar to check future mutations against when they are pulled from tlogs.

You can even do this across storage servers because FDB versions are global, so this approach can scale out. In most databases this would be a nightmare to do correctly but FDB’s underlying architecture is so strong the implementation practically writes itself.

But they never built it.

garrison · July 11, 2025, 4:06pm

I have been thinking about the async APIs a lot and I’m unable to come up with anything better. There is something about passing around futures in Elixir that just bothers me, but I think it would have to be corrected with coding style rather than a different API, i.e. trying to inline them as often as possible, like:

[alice, bob] = Repo.await([get("alice"), get("bob")])

Rather than:

alice_future = get("alice")
bob_future = get("bob")
[alice, bob] = Repo.await([alice_future, bob_future])

Obviously real queries will be messier than this so it will be interesting to see if that style holds up. I did have one neat idea, though: one of my favorite features of Ecto is how select() returns data matching the shape you ask for. Similarly, you could do something like this:

{alice, bob} = Repo.await({get("alice"), get("bob")})
%{alice: alice, bob: bob} = Repo.await(%{alice: get("alice"), bob: get("bob")})
%{users: %{alice: alice, bob: bob}} = Repo.await(%{users: %{...})

…and so on. I wonder if it would be more composable that way.

jstimps · July 11, 2025, 4:34pm

Honestly, me too, even though I’m the one that is responsible for it in this case. A BEAM-friendly way is to write a GenServer for each query, and call them refs instead of futures . Maybe there’s something here, but it’s a lot more typing.

defmodule CreateTeam do
  # the TxGenServer would handle create, commit, retry, on the tx
  use EctoFoundationDB.TxGenServer

  # ...

  def init(_tx) do
    alice_ref = Repo.get_by(User, name: "Alice")
    bob_ref = Repo.get_by(User, name: "Bob")
    {:ok, %{alice: alice_ref, bob: bob_ref}}
  end

  def handle_ready(alice_ref, state=%{alice: alice_ref}) do
    check_finished(%{ state | alice: Repo.await(alice_ref) })
  end

  # .. same for bob ..

  def check_finished(state=%{alice: alice, bob: bob})
  when not is_reference(alice) and not is_reference(bob) do
    Repo.insert!(Team, %{members: [alice.id, bob.id]})
    {:stop, state}
  end

  def check_finished(state), do: {:noreply, state}
end

garrison · July 11, 2025, 4:53pm

I don’t think process shenanigans save us here, unfortunately. Stepping back to the KV layer for a minute, the problem I’m worried about is that you really want to batch multiple requests into a single message to the storage server. If I’m querying 10 keys at once there’s a pretty good chance they’re all going to the same server (or a couple) so batching is a really good idea. But in order to do that there’s really no way to avoid an API like this:

%{"key1" => v1, ...} = get(["key1", "key2", "key3"])

I don’t actually know what the FDB client does here. I would think they want to batch too, but they use individual futures for the keys. Are the reads internally delayed a short time? That just sounds messy to me.

Anyway, a high-level query API (like yours) will still have to compile down to that multi-key get interface under the hood, and so I really don’t see any way out here. I guess I can cope by telling myself that Task.await_many() exists

What’s funny is when using Ecto/SQL we generally get around this by just eating tons of round-trips and not caring, but that doesn’t seem like something to aspire to!

jstimps · July 11, 2025, 7:54pm

You’re surely already aware, but for the sake of others: unfortunately the fdbclient network protocol is not documented publicly, and all client implementations are reliant on libfdb_c. The FDB developers view the client as part of the cluster with some important functions that could be missed with a faulty implementation. This makes it a bit challenging to find answers to behavioral questions.

However, this quote implies that the client does batch Get requests, but doesn’t go into detail.

Layers can easily make multiple concurrent get calls and wait on the returned futures. Internally, client library is free to combine multiple get calls destined to same storage server, into a single network request.

FDB Docs | Multi Gets and Sets

Related: get-read-versions (GRVs) are definitely batched. This is widely documented and is a key part of achieving high read throughput.

Of course, the best throughput always comes from GetRange requests, so those are always preferred if possible.

garrison · July 12, 2025, 7:17pm

Probably what happens is the get()s are queued up on the client thread and then once you trigger an await on any of them they are all sent out as a batch. For some reason in my head I thought they would be sent on get() rather than await() but obviously that makes no sense.

The implicit awaiting going on in some of the language bindings seems confusing to me, though. I prefer the explicit Repo.await() in your API as it mirrors what is intuitively a multi-get operation even if the FDB client is still trying to be clever under the hood

It’s funny, if you look at an Ecto query the APIs are not actually as divergent as they first appear:

bob_future = from u in Users, where: u.name == "bob", limit: 1, select: u
bob = Repo.one!(bob_future)

bob_future = Repo.get_async("/users/bob")
bob = Repo.await(bob_future)

It’s actually the same API! The only difference is that Ecto does not support executing multiple queries in one round-trip. AFAIK it actually is valid to send multiple SQL statements in one request, but it seems like nobody ever does that. I wonder why that is.

jstimps · July 12, 2025, 7:49pm

No, the await is simply a receive block, nothing more:

github.com/foundationdb-beam/erlfdb

src/erlfdb.erl

31067d075


      
          wait(?IS_FUTURE = Future, Options) ->
              case is_ready(Future) of
                  true ->
                      Result = get(Future),
                      % Flush ready message if already sent
                      flush_future_message(Future),
                      Result;
                  false ->
                      Timeout = erlfdb_util:get(Options, timeout, infinity),
                      {erlfdb_future, MsgRef, _Res} = Future,
                      receive
                          {MsgRef, ready} -> get(Future);
                          {{_TxRef, MsgRef}, ready} -> get(Future)
                      after Timeout ->
                          erlang:error({timeout, Future})
                      end
              end;
          wait(Ready, _) ->
              Ready.
          
          -if(?DOCATTRS).

The get() batching, however it exists, would be a dynamic decision by libfdb_c. Non-batching behavior is to send the get() on the wire immediately. The receive block in the wait is satisfied by the data returning on the network thread, which dispatches the message to the calling process.

github.com/foundationdb-beam/erlfdb

c_src/main.c

31067d075


      
          static void erlfdb_future_cb(FDBFuture *fdb_future, void *data) {
              ErlFDBFuture *future = (ErlFDBFuture *)data;
              ErlNifEnv *caller;
              ERL_NIF_TERM msg;
          
              // FoundationDB callbacks can fire from the thread
              // that created them. Check if we were actually
              // submitted to the network thread or not so that
              // we pass the correct environment to enif_send
              if (enif_thread_type() == ERL_NIF_THR_UNDEFINED) {
                  caller = NULL;
              } else {
                  caller = future->pid_env;
              }
          
              enif_mutex_lock(future->lock);
          
              if (!future->cancelled) {
                  msg = T2(future->msg_env, future->msg_ref, ATOM_ready);
                  enif_send(caller, &(future->pid), future->msg_env, msg);

This file has been truncated. show original

Schultzer · July 12, 2025, 8:12pm

It’s actually the same API! The only difference is that Ecto does not support executing multiple queries in one round-trip. AFAIK it actually is valid to send multiple SQL statements in one request, but it seems like nobody ever does that. I wonder why that is.

It’s common to execute multiple statements in a query. Ecto and Ecto.Query only supports that within a transaction. Doing this outside of a transaction with is also common would require you to write raw SQL and only myxql supports returning multiple values.

garrison · July 12, 2025, 8:33pm

Do you know if this is actually what happens? It seems like the C API allows you to create get() futures and then block on any (one) of them (or register a callback). The point at which you block seems like a great time to send out a batch, though of course it could also batch early if a lot of futures are created. There is a classic batching throughput/latency tradeoff here.

If all you do is register callbacks I don’t see how the client would know you want to block, though. So I guess it must just be a timeout after all? I feel like providing explicit control to the developer here is preferable, so I’m curious why they did it that way.

I should say for the record that I have spent almost zero time reverse-engineering the client because unlike the server side the client implementation is pretty obvious (sans the specifics, clearly).

Wait, it does? What is the syntax for that?

Schultzer · July 12, 2025, 9:25pm

When I execute multiple queries in a transaction, then I use Ecto multi, however that has become frown upon, for reasons I don’t find compelling. So today you would probably have to wrap all of them in anonymized function if you were to use Ecto. But for me, in the future I will be using SQL, which removes any limitations and abstraction.

garrison · July 13, 2025, 6:29pm

But Ecto.Multi still doesn’t execute multiple statements in one request, right? I meant something like this:

select * from users where name = 'bob';
select * from users where name = 'alice';
# ... and so on

That is, sending multiple (semicolon-separated) SQL statements in one round-trip to the server. Obviously there are ways you could constrain this particular example into one statement, but there are cases where that gets messy (if the queries were more complicated).

I am not aware of any way to do this with Ecto short of dropping down to raw SQL, and even then I’m not sure if multiple result sets are supported. TBH I had some trouble determining if Postgres even supports them, though I did not look that hard.

FDB is a little different than an SQL database because the idea is you’re supposed to write the high-level query planner stuff as a stateless client to the underlying datastore, so unlike with SQL we need to think a bit harder about round-trips. The batching is also difficult because there are actually many servers and choosing which ones to send the requests to is a little more complicated than you might think, plus you might want to make requests to multiple servers concurrently to save latency.

But in a parallel universe where things went differently, you might expect a syntax like this:

query1 = one(from u in Users, where: u.name == 'alice')
query2 = one(from u in Users, where: u.name == 'bob')
{alice, bob} = Repo.execute({query1, query2})

And as you can see this is essentially what @jstimps has ended up with, except he has to fight Ecto a bit to do it because it was not designed for this.

Schultzer · July 13, 2025, 7:09pm

Myxql supports multiple results Ecto.Adapters.SQL — Ecto SQL v3.13.2, but yeah you would have to write SQL, my memory might be off, but I could swear that the transaction would happen in one go. If not then theres another limitation to Ecto I wasn’t aware of.

josevalim · July 13, 2025, 7:39pm

Exactly. Ecto doesn’t support it through its API but it does have Repo.query where the adapter can do what it wants (including multiple queries). This is a feature we could add at the high level API if required though and contributions are welcome.

jstimps · August 24, 2025, 2:21am

Hi folks,

An update that’s not strictly EctoFDB related – I converted (/copied) the official FoundationDB Class Scheduling Tutorial to an Elixir-focused Livebook:

In the tutorial, we develop a simple data layer (using key-values in subspaces) that an application could use to sign up students for classes and drop those classes, adding some interesting business logic along the way.

The tutorial could be useful for anyone who wants to

Get started with :erlfdb itself
Understand the essentials of the EctoFDB implementation, which takes these same ideas and fits them to a subset of the Ecto abstractions

garrison · August 24, 2025, 4:10pm

Nice guide! Reading through it reminded me of something.

I had been considering for some time whether it would be a good idea to make the tuple encodings “follow” Erlang term order. Not exactly, but closer than the FDB tuples. The first step would be to rearrange the typecodes to match term order, which is easy enough.

But the problem is that Erlang tuples are sorted length-first rather than lexicographically, which is very bad behavior for something like FDB. However, during a conversation with @Asd in the Bedrock thread I realized that if you just avoid tuples and use lists instead you don’t have this problem because lists are ordered correctly (by their elements). Which makes perfect sense, because computing the length of a list would be very expensive. (It remains a mystery why tuples are compared in such an unhelpful manner, though.)

The irony is that I was implementing tuple encodings a couple of weeks ago following the :erlfdb_tuple implementation (out of laziness) and I noticed that the encoder/decoder uses lists internally and converts from/to tuples at the start/end, which makes perfect sense because you want to build up the list incrementally as you parse. And I thought “why bother, the lists will generally be short anyway”, so I just used lists in the API instead of tuples.

So as it turns out, I have actually already done this. By accident!

With erlfdb this is probably not something you want to change at this point (and you wouldn’t want to break compatibility with FDB tuples either), but I’m curious what you think about trying to follow term order with the encodings. It’s not something that actually matters in reality, but I find it oddly satisfying.

The performance of lists vs. tuples is an interesting question. In practice, I assume tuples are slightly faster:

[a, b, [c, d]] = Tuple.unpack(bin)
# vs.
{a, b, {c, d}} = Tuple.unpack(bin)

But for short lists I’m doubtful there is a meaningful difference. And using tuples costs some performance too because there is an extra conversion (:erlfdb_tuple builds up a list first). Is there any record of why erlfdb uses tuples over lists?

The integer/float comparison behavior seems like a bad path to go down, though, so I would still deviate from term order there I think.

jstimps · August 24, 2025, 5:11pm

I haven’t seen one. Now that you mention it, I do agree that using lists in Erlang/Elixir (I’ll use “erts” for shorthand) would be more ergonomic. As you say, the FDB Tuple layer encourages the practice of “building up” a fdb-tuple, which is awkward with an erts-tuple, since they’re fixed length data structures.

Also, pattern matching on erts-lists is even more powerful than erts-tuples, so it sounds like very good idea indeed!

{"user", user_id, _, _, _, _, _, _}
  = :erlfdb_tuple.unpack(key) # ugh!

["user", user_id | _]
  = :erlfdb_tuple_v2.unpack(key) # yay!

I have a bad track record of predicting reality in micro-benchmarks like this, but I wonder if using erts-lists would actually be faster in the best case than erts-tuples due to the conversion you mention. Either way, it’s likely to be negligible compared to I/O .

I can see why this would be useful for your database server, likewise for Bedrock, since you’re more likely to want to do key comparisons with both the binary representation and the data structure. On the client, this has never been a pain point for me, though, since the server always returns keys in the correct sort order, and I can’t remember ever needing to compare them myself. That being said, I’m a big fan of design simplicity, so from me!

On the question of whether or not an ERTS-friendly Tuple V2 would be a good idea to store in actual FDB – I’m a maybe on this. On the one hand, FDB is supposed to allow the client to be entirely in control of the Layer.

On the other hand, GetMappedRange exists. As you’re aware, FDB implements this feature with assumptions about the key and value encoding, specifically that they are Tuple encoded. If you were to only change the type codes, it would probably still work because FDB would have no reason to decode the types. AFAIK their only assumption is regarding the boundaries between fields. But GetMappedRange is so finicky that I worry there is some dragon lurking there.

Keeping in mind how important records were (and still are) in Erlang, the tuple ordering makes sense. Having fixed-length tuples together means that different versions of your ets and mnesia records would be nicely grouped. I don’t know if this is the reason, but it seems like a real benefit.

(This also illustrates why records can be tricky )

iex(1)> :ets.new(:tab, [:named_table, {:keypos, 2}, :bag])
:tab
iex(2)> :ets.insert(:tab, {:user, "1"})
true
iex(3)> :ets.insert(:tab, {:user, "2"})
true
iex(4)> :ets.insert(:tab, {:user, "1", "Alice"})
true
iex(5)> :ets.insert(:tab, {:user, "2", "Bob"})
true
iex(6)> :ets.tab2list(:tab) |> Enum.sort()
[{:user, "1"}, {:user, "2"}, {:user, "1", "Alice"}, {:user, "2", "Bob"}]

Wdym? If you’re going for a stable sort order in both binary and erts, you must, no?