Ecto Changeset and Primary key ID being skipped

BillBryson · January 29, 2021, 4:09am

Hello, I am new to Ecto and am running into a scenario I haven’t encountered.

I have a User Ecto Schema that belongs_to an Organization. I have a unique_constraint set on the User's email field. Here is the gist of what I am doing:

defmodule MyApp.User do
  use Ecto.Schema
  import Ecto.Changeset
  
  schema "users" do
    field :email, :string
    belongs_to :organization, MyApp.Organization
    ...
  end

  
  @doc false
  def changeset(user, attrs) do
    user
    |> cast(attrs, [:email, :password, :organization_id])
    |> cast_assoc(:organization)
    ....
  end

  ...
end

defmodule MyApp.Organization do
  use Ecto.Schema
  import Ecto.Changeset
  
  schema "organizations" do
    field :name, :string
    field :subdomain, :string
    has_many :users, MyApp.User
    ...
  end

  
  @doc false
  def changeset(org, attrs) do
    user
    |> cast(attrs, [:name, :subdomain])
    |> unique_constraint(:subdomain)
    ....
  end

  ...
end

If I go to my UI and try to create a record with an existing email or organization subdomain I get an invalid changeset and my form is re-rendered. But then if I correct the email and resubmit the form it get inserted, but it skips the next primary key for the users and organizations tables. For example if the last user has an ID of 5 and the validation fails the user is inserted with the ID of 7 - skipping 6.

I am sure I’m doing something inconsistent, but any help would be appreciated. Thank you.

kokolegorille · January 29, 2021, 5:07am

Unfortunately, You cannot guarantee ids are going to be continuous.

dimitarvp · January 29, 2021, 2:25pm

Not sure what you need help with – generating continuous IDs in the DB (which you have no real control over) or want to correct your changeset / form somehow?

al2o3cr · January 29, 2021, 3:04pm

You didn’t specify which DB, but this is definitely a known behavior for Postgresql.

The cause has two parts:

the unique_constraint, which works by handling the error an attempted INSERT of a duplicate raises
the database’s ID generation (in PG, it’s sequences)

In Postgres, getting the next value from a sequence is immediately visible outside the current transaction - otherwise, every other attempt to get the next value would have to block waiting to see if the first transaction commits.

This also applies if the transaction executes a ROLLBACK; the sequence values will stay updated.

If gapless sequential numbering of a set of rows is important, you should represent that numbering explicitly in your schema and changeset; you’ll need to be careful with transactions etc.

BillBryson · January 29, 2021, 4:14pm

okay thank you for the responses. I am using PostgreSQL and it seems that I am concerned over nothing.

Ninigi · January 29, 2021, 5:44pm

Don’t worry, you are not doing anything wrong, that is expected behavior.

Disclaimer: I am not claiming to be an expert, and I am happy to learn, but please don’t give me some super deep explanations of “how this actually works on a blahblahblah level” that I won’t understand anyways.

Here is my take on this issue

Think of it like this:
User A submits their form at the exact same time (that pretty much never happens, but for the sake of simplification lets say they all submit only one detectable “time” for our system) as User B and User C. It looks like this:

User A:
username: iamuser
password: 123password

User B: 
username: iamuser
password: oikjlkjuhjikh

User C:
username: lolololololo
password: asdfasdfasdf

Databases are very complex in how they query things, but the writing operations are (potentially) even more complex. There might be indexing, constraints, unique constraints etc… However, we are human, and we want to go things go fast, ergo we want writing operations to be as fast as possible - but also to be as save as possible. The only way of guaranteeing a sequence is by restricting the write action to be sequential and non-parallel.
In order to improve writing speed, the database might try to split the writing operations evenly to your CPU cores (again, not a literal description of what happens, but more than good enough to picture the process) - and that’s where literal chaos ensues on the lowest level. It is so fast, that even a tiny variation in how far certain “circuits” are apart in the CPU itself can make a difference - or there is some kind of OS process kicking in, delaying certain CPU executions etc. Everything here happens extremely fast, but still things have to go through a procedure. Tick 1, Tick 2, Tick 3 etc. Nothing can go faster than the ticks.

We want ids to be sequential, and never break sequence, so we are looking to combine parallel processes (which is possible with separate cores) with the absolute minimum in sequential processing.

Writing IDs is, of course, slower than 1 Tick, but you can set a restriction “don’t do anything until this operation is done, no matter how many ticks we do.” Naturally, you don’t know how long the rest of the writing operation will take, so you want to make the blocking operations as fast as possible, and restrict them to an absolute minimum, in order to avoid blocking anything else that might want to use your writing operation’s processing power. Lets restrict it to counting up a number then, and try to never break this sequence (ids) and tick the number up every time a write request comes in.

Now what might happen is that User A might actually finish writing after User B. If we hadn’t reserved ID’s, we would be in a pickle now, because User C is also about to finish. Would we try to roll back User C’s ID, and User B’s ID… What about everything that has been queued afterwards?

TLDR; It is much saver to restrict ids to just count up every time an insert is queued, and leave it alone afterwards, than trying to fix following operations.

BillBryson · January 30, 2021, 12:07am

Thank you this does help me feel better about my setup