Hi there. I am trying to implement the logic that does get_by(..) and if not exists, it should insert and return.
The first implementation is like this. (We are using Postgres.)
def get_or_insert(account_id, type) do
case Repo.get_by(Storage, account_id: account_id, type: type) do
nil -> Repo.insert(Storage, %{account_id: account_id, type: type})
storage -> {:ok, storage}
end
end
The storage has the composite unique index for account_id and type. So it works fine until there are 2 same requests. I expected it to return same storage for both req, but one of them will get error for type already exists because of race condition that the Repo.get_by from 2 req were evaluated at the same time before any insertion.
As I am familiar with Rails, I would solve this in Rails by doing account.lock do ... end to prevent reading the record at the same time. But not sure how to implement this in Ecto. Any idea to implement this idea while avoiding race condition?
So far, this one is the most similar for what I am looking for, but still answer only the case of get(id)
Yes! Either by pessimistic locking (which you can do by wrapping you code in a Repo.transaction, assuming that you use Postgres or another database that supports them), which locks the whole table,
or by optimistic locking, which requires an extra field in your DB table, but is much faster in cases where conflicting writes are uncommon, since it only ālocksā one record in the table.
There also is Ecto.Query.lock but it is more low-level and DB-specific.
Do transaction in postgres actually lock the whole tables? I think it might depend on isolation levels, which is read committed, by default (I think), which I donāt think locks the tables but uses mvcc.
Depending on the problem, you can also use upserts. If you do insert(data, on_conflict: :nothing, returning: true), it will do nothing if the data exists and return all fields from the database.
No. Doing an insert does not prevent other transactions doing inserts. You have to use an explicit lock command (in Postgres) to lock the whole table. Doing an update will block other updates on those rows, and may block reads depending on the isolation level of the reader (but not by default in Ecto).
This is not correct. Isolation level will not ever cause a table to be locked by an insert nor other DML command. Locks and isolation level are separate concepts. According to the Postgres documentation, these are the only commands that lock a full table: ALTER TABLE, DROP TABLE, TRUNCATE, REINDEX, CLUSTER, and VACUUM.
DML commands like insert/update obtain ROW EXCLUSIVE locks.
Isolation level is concerned with isolation between transactions, and preventing dirty reads/phantom reads. SERIALIZABLE simply means the transactions must have the same semantics when running in parallel, that they would have running one after the other. Yes this requires read locks, and not reading new rows from another transaction but that does not lock the whole table against inserts.
To make sure I understand this correctly (because TFM seems ambiguous on this - actually more than the above quote):
When thereās conflict (like a UNIQUE constraint being violated), does the insert/2 function return the conflicting record from the database, discarding the struct or changeset passed to it? What if multiple records conflicted with the passed data? Like when violating constraints on several fields?
Update:
Did some tests (using Postgres), and it seems that given a changeset, inserting which would violate unique constraint, Repo.insert(changeset, returning: true, on_conflict: :nothing) returns the data passed to it rather than the record from the database. Is this the expected behaviour? If yes then whatād be the way to obtain the existing record without performing a separate query for this purpose?
As clarified in a separate thread this is not actually true. If the function only translates syntax between Elixir and SQL then only the equivalent SQL behaviour remains valid. PostgreSQL documentation says:
The optional RETURNING clause causes INSERT to compute and return value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used)
This is consistent with results of my quick tests above: on_conflict: :nothing prevents returning fields from the database as no rows are āactually inserted (or updated)ā. Shall see if I can improve this section of the docs
Still not without race condition, but I really needed a basic find_or_create functionality found in Rails and I liked this neat trick initially shared on SO by defining such function in the MyApp.Repo module in lib/my_app/repo.ex:
def get_or_insert(schema, params) do
case get_by(schema, params) do
nil -> insert(schema.changeset(schema.__struct__, params))
struct -> {:ok, struct}
end
end
This allows you to do MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) for any model in your app then.
Think lots of people searching for this functionality will also land here, so re-sharing.
Upset doesnāt cut it for my use case, as I often need to search against multiple columns including jsonb, and storing huge unique indexes on these columns doesnāt make sense.
One notable gotcha with this approach in Ecto vs ActiveRecord: ActiveRecordās find_by takes the āfirstā matching record and silently ignores anything else, while Ectoās get_by includes enforcement that there is exactly one matching record and raises if not.
That means that if two racing calls to MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) ever manage to both do an insert, every subsequent call to MyApp.Repo.get_or_insert(MyItem, %{var1: "some value"}) will crash.
Strong agree. Most use cases for this kind of function should come with a unique constraint so the on_conflict option works perfectly well (Iāve never run into any gotchas around the behavior of returning). If itās really not possible to use a constraint I would think itās probably worth it to use a transaction and a lock, since it doesnāt add that much more complexity.
Good shout. Iāve noticed this and planned to catch the error, but realised in my use cases it was essentially not possible to have two copies of the same record for the tables Iām using this, so went with the original suggestion from SO that doesnāt catch.
Can you elaborate on what āessentially not possibleā means in your use case if it isnāt enforced via a unique constraint? In my experience if the database allows it, it will eventually happen.
On default postgres even a transaction and an explicit lock is not enough to prevent a race condition where multiple rows get inserted. You canāt take an explicit lock on a row that doesnāt exist - for that you need a predicate lock (in practice a range lock), which you will get if you run under SERIALIZABLE isolation.
Even under snapshot isolation (I believe this is ārepeatable readā in postgres) you will still have this anomaly. If you have two concurrent transactions they will both see that the row is missing in their snapshots and so they will both insert a new row.
So your options here are either to run under serializable isolation (which has a performance hit as I understand it) or to just use a unique constraint. Frankly I am struggling to think of a scenario where you want to get_or_insertwithout a unique constraint - the get_or_insert itself essentially is a unique constraint, albeit a buggy one.
Also, even if you run under serializable isolation Iām pretty sure you still wonāt get an atomicget_or_insert, because postgresās āserializable snapshot isolationā (SSI) is optimistic and I belive it will just abort one of the transactions. I think the only way to get a truly atomic operation is to do with Jose suggested above and use an insert/on_conflict/returning with a unique constraint.
Finally, there is actually another option - you can āmaterialize the conflictā. In this case that would mean locking another related row which you know exists. For example, imagine you had a table users and another table user_profiles, and you wanted to enforce that there is only one row in the latter for each user, and you didnāt want to use unique constraints or serializable for some reason. You could instead lock the userās row in users before doing your get_or_insert, which would effectively serialize the operations and protect you from the race condition. This works because the user row already exists, so thereās always a row to lock on.
Iāve actually used this last trick for a product Iām working on, which involves some rather nightmarish logic for implementing nested drag/drop reordering with undo/redo on top of postgres rows. Clearly I need to write a blog post about thisā¦
P.S. for those interested in this sort of thing there is actually a really cool paper about postgresās SSI implementation:
Itās a great read. Serializable snapshot isolation is a neat trick and is used in some distributed databases too (e.g. CockroachDB and FoundationDB). FoundationDBās distributed SSI in particular is a really simple and beautiful algorithm. Cockroachās on the other hand is kinda cursed but I digress.
In my case I only insert records into the table through the get_or_insert for the given table, so unless race condition happens there (possible of course), there are no other places to insert a duplicate record.
Iām more after a simpler Rails find_or_create behaviour that doesnāt require a DB constraints and quietly returns the first matching record. To match that behaviour closer, Iāve updated my code:
def get_or_insert(schema, params) do
case find_first_by(schema, params) do
nil -> insert(schema.changeset(schema.__struct__, params))
struct -> {:ok, struct}
end
end
import Ecto.Query, only: [where: 2, first: 1], warn: false
def find_first_by(schema, params) do
params = Map.to_list(params)
where(schema, ^params) |> first |> one
end
def get_or_insert!(schema, params) do
{:ok, record} = get_or_insert(schema, params)
record
end
In my use case duplicates arenāt the end of the world, but if it becomes an issue, Iāll look into adding constraints. I have bulky jsonb columns on one of the tables where Iām using get_or_insert, and Iām worried creating unique indexes (Postgres) might take too much storage. Would be nice to test to see if thatās indeed the case, but my time is limited, and this solutions does the trick for now.
I donāt know what Rails does, but the reason Ecto lacks this feature is probably due to what I mentioned above: itās unfortunately just not possible to do this safely without serializable isolation or a constraint.
Obviously you gotta do what you gotta do, but the race will bite you eventually, and probably when itās least convenient (or however the saying goesā¦)
To be clear, the json columns would only inflate your index if theyāre actually in the constraint, i.e. they are the things that you are checking for uniqueness. Thatās rather unusual, but if thatās really the case then you could hash the json (like just dump it into sha256) and then store the hash along with the data and put the constraint on that instead. Iāve actually done this exact thing (except on multiple columns) to diff RSS entries for the very same app I mentioned above.
Also, if the JSON is what you are checking for uniqueness, that would mean youāre doing your lookup against a column with large JSON blobs without an index. Once that gets slow (it will), youāre going to need an index. At which point, youāre back to square oneā¦
Iāve never had much cause to dig too deep into postgresā full feature set around this sort of thing, precisely because a constraint is almost always called for and they are what I rely on 95% of the time.
For the other times however, I have used explicit locking I am wondering how to square what youāre describing with the point @jeremyjh makes above, which specifically distinguishes isolation level from locking. My understanding is two connections requesting an exclusive lock on a table in order to perform this kind of āfind or insertā will protect against race conditions.