Oban unique constraint crashes the queue

I recently upgraded Oban and Oban Pro to the latest version, 2.19.2 and 1.5.2 respectively.

I’ve faced a situation related to the unique constraint that crashed the queue and started creating hundreds of producers for the same queue.

There is a worker with the following unique configuration:

[
  fields: [:queue, :worker, :args],
  keys: [:conversation_id],
  states: [:scheduled, :executing, :retryable]
]

As you can see, the available state was not included in the states list.

There was a job on available state and other job was scheduled with the same args. The insert worked, but I think when it tried to update the scheduled job to executing it crashed everything. The fix was to manually delete the two jobs. The error was:

GenServer {Oban.Registry, {Oban, {:producer, "my_queue"}}} terminating
** (Postgrex.Error) ERROR 23505 (unique_violation) duplicate key value violates unique constraint "oban_jobs_unique_index"

    table: oban_jobs
    constraint: oban_jobs_unique_index

Key (uniq_key)=(KcFMKL8Lc5Yhu9w58TM27eZNdPcftdhRYWHIXYNaygM) already exists.
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1096: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:994: Ecto.Adapters.SQL.execute/6
    (ecto 3.12.5) lib/ecto/repo/queryable.ex:232: Ecto.Repo.Queryable.execute/4
    (oban_pro 1.5.2) lib/oban/pro/engines/smart.ex:1113: Oban.Pro.Engines.Smart.fetch_jobs/2
    (ecto 3.12.5) lib/ecto/multi.ex:897: Ecto.Multi.apply_operation/5
    (elixir 1.18.2) lib/enum.ex:2546: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto 3.12.5) lib/ecto/multi.ex:870: anonymous fn/5 in Ecto.Multi.apply_operations/5
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1400: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4

I believe the same could happen if there was an executing job and I tried to re-run a canceled/discarded job with the same args.

Is that expected?

For now, I changed my workers to include the available state to the unique configuration.

Checking the unique index, is on uniq_key on condition (uniq_key IS NOT NULL), and it’s interesting, because the two jobs in question only one had uniq_key, the other was empty, but both had it in the meta.

This particular crash will only happen in the fetch_jobs/3 callback, because it’s not catching postgrex errors. All of the other state transition functions will catch that exception and correct the unique violation. It may cause a slowdown, as it has to fix issues one at a time, but it won’t crash the producer.

The upcoming Pro v1.5.3 patch will handle this issue.

The situation itself is expected, because it’s a partial set of unique states. The fact that it crashes the producer isn’t expected and certainly not desirable.

On Slack I suggested that there are only three safe/desirable unique configurations:

There are three viable unique state configurations:

  • successful - ~w(available scheduled executing retryable complete)a (the default)
  • incomplete - ~w(available scheduled executing retryable)a
  • comprehensive - ~w(available scheduled executing retryable complete cancelled discarded)a

Arguably, a setup that only uses scheduled for debouncing can also work as long as you refrain from using snooze because it makes the scheduled state reentrant.

Two jobs can’t have the same uniq_key, the index prevents that. The collision happens when the updated job is about to have that state. You can identify the conflict by checking the job’s meta, as you noted.

1 Like