How to do get_or_insert_by in Ecto without race condition?

turboza · May 5, 2019, 9:44am

Hi there. I am trying to implement the logic that does get_by(..) and if not exists, it should insert and return.

The first implementation is like this. (We are using Postgres.)

def get_or_insert(account_id, type) do
  case Repo.get_by(Storage, account_id: account_id, type: type) do
    nil -> Repo.insert(Storage, %{account_id: account_id, type: type})
    storage -> {:ok, storage}
  end
end

The storage has the composite unique index for account_id and type. So it works fine until there are 2 same requests. I expected it to return same storage for both req, but one of them will get error for type already exists because of race condition that the Repo.get_by from 2 req were evaluated at the same time before any insertion.

As I am familiar with Rails, I would solve this in Rails by doing account.lock do ... end to prevent reading the record at the same time. But not sure how to implement this in Ecto. Any idea to implement this idea while avoiding race condition?

So far, this one is the most similar for what I am looking for, but still answer only the case of get(id)

Qqwy · May 5, 2019, 10:12am

Yes! Either by pessimistic locking (which you can do by wrapping you code in a Repo.transaction, assuming that you use Postgres or another database that supports them), which locks the whole table,
or by optimistic locking, which requires an extra field in your DB table, but is much faster in cases where conflicting writes are uncommon, since it only ‘locks’ one record in the table.

There also is Ecto.Query.lock but it is more low-level and DB-specific.

idi527 · May 5, 2019, 11:24am

Do transaction in postgres actually lock the whole tables? I think it might depend on isolation levels, which is read committed, by default (I think), which I don’t think locks the tables but uses mvcc.

josevalim · May 5, 2019, 12:08pm

Depending on the problem, you can also use upserts. If you do insert(data, on_conflict: :nothing, returning: true), it will do nothing if the data exists and return all fields from the database.

We also talk about it on the What’s New in Ecto 2.0 ebook (95% of it still applies to Ecto 3). http://pages.plataformatec.com.br/ebook-whats-new-in-ecto-2-0

jeremyjh · May 5, 2019, 12:38pm

No. Doing an insert does not prevent other transactions doing inserts. You have to use an explicit lock command (in Postgres) to lock the whole table. Doing an update will block other updates on those rows, and may block reads depending on the isolation level of the reader (but not by default in Ecto).

Qqwy · May 6, 2019, 7:27am

You are correct; it depends on the isolation level, which, as long as it isn’t SERIALIZE will not lock the whole table.

jeremyjh · May 8, 2019, 4:04am

This is not correct. Isolation level will not ever cause a table to be locked by an insert nor other DML command. Locks and isolation level are separate concepts. According to the Postgres documentation, these are the only commands that lock a full table: ALTER TABLE, DROP TABLE, TRUNCATE, REINDEX, CLUSTER, and VACUUM.

DML commands like insert/update obtain ROW EXCLUSIVE locks.

Isolation level is concerned with isolation between transactions, and preventing dirty reads/phantom reads. SERIALIZABLE simply means the transactions must have the same semantics when running in parallel, that they would have running one after the other. Yes this requires read locks, and not reading new rows from another transaction but that does not lock the whole table against inserts.

Qqwy · May 8, 2019, 5:56am

@jeremyjh Thank you for clarifying !

silverdr · October 11, 2021, 7:11pm

To make sure I understand this correctly (because TFM seems ambiguous on this - actually more than the above quote):

When there’s conflict (like a UNIQUE constraint being violated), does the insert/2 function return the conflicting record from the database, discarding the struct or changeset passed to it? What if multiple records conflicted with the passed data? Like when violating constraints on several fields?

Update:

Did some tests (using Postgres), and it seems that given a changeset, inserting which would violate unique constraint, Repo.insert(changeset, returning: true, on_conflict: :nothing) returns the data passed to it rather than the record from the database. Is this the expected behaviour? If yes then what’d be the way to obtain the existing record without performing a separate query for this purpose?

silverdr · October 19, 2021, 6:32pm

As clarified in a separate thread this is not actually true. If the function only translates syntax between Elixir and SQL then only the equivalent SQL behaviour remains valid. PostgreSQL documentation says:

The optional RETURNING clause causes INSERT to compute and return value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used)

This is consistent with results of my quick tests above: on_conflict: :nothing prevents returning fields from the database as no rows are “actually inserted (or updated)”. Shall see if I can improve this section of the docs

seva · July 4, 2024, 6:36pm

Still not without race condition, but I really needed a basic find_or_create functionality found in Rails and I liked this neat trick initially shared on SO by defining such function in the MyApp.Repo module in lib/my_app/repo.ex:

def get_or_insert(schema, params) do
  case get_by(schema, params) do
    nil -> insert(schema.changeset(schema.__struct__, params))
    struct -> {:ok, struct}
  end
end

This allows you to do MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) for any model in your app then.

Think lots of people searching for this functionality will also land here, so re-sharing.

Upset doesn’t cut it for my use case, as I often need to search against multiple columns including jsonb, and storing huge unique indexes on these columns doesn’t make sense.

al2o3cr · July 4, 2024, 7:04pm

One notable gotcha with this approach in Ecto vs ActiveRecord: ActiveRecord’s find_by takes the “first” matching record and silently ignores anything else, while Ecto’s get_by includes enforcement that there is exactly one matching record and raises if not.

That means that if two racing calls to MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) ever manage to both do an insert, every subsequent call to MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) will crash.

dimitarvp · July 4, 2024, 8:33pm

Which is why his function’s body should be wrapped in Repo.transaction, and maybe with the lock type specified (depending on scenario).

tfwright · July 4, 2024, 11:51pm

Strong agree. Most use cases for this kind of function should come with a unique constraint so the on_conflict option works perfectly well (I’ve never run into any gotchas around the behavior of returning). If it’s really not possible to use a constraint I would think it’s probably worth it to use a transaction and a lock, since it doesn’t add that much more complexity.

seva · July 5, 2024, 12:21am

Good shout. I’ve noticed this and planned to catch the error, but realised in my use cases it was essentially not possible to have two copies of the same record for the tables I’m using this, so went with the original suggestion from SO that doesn’t catch.

benwilson512 · July 5, 2024, 1:49am

Can you elaborate on what “essentially not possible” means in your use case if it isn’t enforced via a unique constraint? In my experience if the database allows it, it will eventually happen.

garrison · July 5, 2024, 3:18pm

On default postgres even a transaction and an explicit lock is not enough to prevent a race condition where multiple rows get inserted. You can’t take an explicit lock on a row that doesn’t exist - for that you need a predicate lock (in practice a range lock), which you will get if you run under SERIALIZABLE isolation.

Even under snapshot isolation (I believe this is “repeatable read” in postgres) you will still have this anomaly. If you have two concurrent transactions they will both see that the row is missing in their snapshots and so they will both insert a new row.

So your options here are either to run under serializable isolation (which has a performance hit as I understand it) or to just use a unique constraint. Frankly I am struggling to think of a scenario where you want to get_or_insert without a unique constraint - the get_or_insert itself essentially is a unique constraint, albeit a buggy one.

Also, even if you run under serializable isolation I’m pretty sure you still won’t get an atomic get_or_insert, because postgres’s “serializable snapshot isolation” (SSI) is optimistic and I belive it will just abort one of the transactions. I think the only way to get a truly atomic operation is to do with Jose suggested above and use an insert/on_conflict/returning with a unique constraint.

Finally, there is actually another option - you can “materialize the conflict”. In this case that would mean locking another related row which you know exists. For example, imagine you had a table users and another table user_profiles, and you wanted to enforce that there is only one row in the latter for each user, and you didn’t want to use unique constraints or serializable for some reason. You could instead lock the user’s row in users before doing your get_or_insert, which would effectively serialize the operations and protect you from the race condition. This works because the user row already exists, so there’s always a row to lock on.

I’ve actually used this last trick for a product I’m working on, which involves some rather nightmarish logic for implementing nested drag/drop reordering with undo/redo on top of postgres rows. Clearly I need to write a blog post about this…

P.S. for those interested in this sort of thing there is actually a really cool paper about postgres’s SSI implementation:

It’s a great read. Serializable snapshot isolation is a neat trick and is used in some distributed databases too (e.g. CockroachDB and FoundationDB). FoundationDB’s distributed SSI in particular is a really simple and beautiful algorithm. Cockroach’s on the other hand is kinda cursed but I digress.

seva · July 5, 2024, 3:51pm

In my case I only insert records into the table through the get_or_insert for the given table, so unless race condition happens there (possible of course), there are no other places to insert a duplicate record.

I’m more after a simpler Rails find_or_create behaviour that doesn’t require a DB constraints and quietly returns the first matching record. To match that behaviour closer, I’ve updated my code:


  def get_or_insert(schema, params) do
    case find_first_by(schema, params) do
      nil -> insert(schema.changeset(schema.__struct__, params))
      struct -> {:ok, struct}
    end
  end

  import Ecto.Query, only: [where: 2, first: 1], warn: false
  def find_first_by(schema, params) do
    params = Map.to_list(params)
    where(schema, ^params) |> first |> one
  end

  def get_or_insert!(schema, params) do
    {:ok, record} = get_or_insert(schema, params)
    record
  end

In my use case duplicates aren’t the end of the world, but if it becomes an issue, I’ll look into adding constraints. I have bulky jsonb columns on one of the tables where I’m using get_or_insert, and I’m worried creating unique indexes (Postgres) might take too much storage. Would be nice to test to see if that’s indeed the case, but my time is limited, and this solutions does the trick for now.

garrison · July 5, 2024, 4:34pm

I don’t know what Rails does, but the reason Ecto lacks this feature is probably due to what I mentioned above: it’s unfortunately just not possible to do this safely without serializable isolation or a constraint.

Obviously you gotta do what you gotta do, but the race will bite you eventually, and probably when it’s least convenient (or however the saying goes…)

To be clear, the json columns would only inflate your index if they’re actually in the constraint, i.e. they are the things that you are checking for uniqueness. That’s rather unusual, but if that’s really the case then you could hash the json (like just dump it into sha256) and then store the hash along with the data and put the constraint on that instead. I’ve actually done this exact thing (except on multiple columns) to diff RSS entries for the very same app I mentioned above.

Also, if the JSON is what you are checking for uniqueness, that would mean you’re doing your lookup against a column with large JSON blobs without an index. Once that gets slow (it will), you’re going to need an index. At which point, you’re back to square one…

tfwright · July 5, 2024, 7:42pm

I’ve never had much cause to dig too deep into postgres’ full feature set around this sort of thing, precisely because a constraint is almost always called for and they are what I rely on 95% of the time.

For the other times however, I have used explicit locking I am wondering how to square what you’re describing with the point @jeremyjh makes above, which specifically distinguishes isolation level from locking. My understanding is two connections requesting an exclusive lock on a table in order to perform this kind of “find or insert” will protect against race conditions.