Oban Pro 1.5.0 migration locked our production table

We’re proud and happy users of Oban.Pro , however in the most recent upgrade to v1.5.0, the migrations ended up locking the oban_jobs table for over 10 mins. This caused our production app to fail on incoming requests.
Our Oban.Pruner deletes jobs older than 1 month for auditing reasons, and we therefore have ~2 million jobs in completed state at any given time.We identified the problematic bit of the migration as a generated column.

This SQL needs a table lock and a sequential scan of the table in order to calcule the value of the GENERATED uniq_key column.
We have sucessfully migrated to v1.5.0, since we were able to back up the table and trim it down to 50_000 rows, which ended up taking ~10 seconds for the snippet above to run.
Sharing this as it might be a pitfall for somebody else upgrading to Oban Pro v1.5.0 and to get a conversation started if you have some thoughts or ideas.

Tags

oban-chat

Thread Titles:

For anyone attempting to migrate to Oban Pro v1.5.0 with a huge oban_jobs table, please be aware of a possible table lock

Hi @mrnovalles :surfing_woman:Thank you for posting this in the forum. We’re sure it will help others. We’ll be sure to add some warnings to the documentation.

2 Likes

Yes, with this particular Oban DB Migration, you should be careful or take some steps before proceeding.

Essentially, we decided to split the Oban Pro Ecto migration into multiple steps, and mostly do manual migrations, following the pattern suggested in the article Automatic and manual Ecto migrations written by the folks from Dashbit.

But, first, you have to understand what these DB schema migrations entail, I will not post the code here out of respect to the Oban authors, and also because you can easily find it at lib/oban/pro/migrations/v1_5_0.ex. But, in general lines, there are 5 big steps:

  1. Creates a UDF or user-defined function that essentially translates Oban states into bits. For example, the scheduled state is 0, the available state is 1, and so on. This is safe to apply even as an automatic Ecto migration.
  2. An ALTER TABLE in the oban_jobs table that will add a generated column uniq_key; this column is special because its value will always be computed from other columns or expressions. This step must be applied in a manual migration after reducing the number of rows as much as possible because it will involve a sequential operation and Access Exclusive lock (the strictest one), as much of the ALTER TABLE operations. If you want to inspect all the locks for this query, check the wonderful Safe Ecto Migrations series from David Bernheisel.
  3. It will add a new partial UNIQUE INDEX against the previous column uniq_key, this index is not concurrent.
  4. It will add a new partial index for batches, this index is not concurrent.
  5. It will add a new partial index for chains, this index is not concurrent.

Once we understand these 5 steps and the problems that could arise when you have millions of rows in the oban_jobs table, you can take some actions. The first thing we did was to take advantage of the DynamicPruner to reduce our policy retention and be more aggressive, as you, we have to keep the records around, thankfully, the DynamicPruner provides an option to set up a callback before deleting the records, in that callback, you can perfectly copy the jobs that are about to be deleted into cold storage or another table, your choice.

But while we were on this pruning crusade, we discovered, and then confirmed with the Oban authors, that the DynamicPruner plugin uses the scheduled_at field as a limit, which is fine, the “problem” is that we use a lot of Oban Workflows, and there is a workaround that sets the scheduled_at field way ahead in the future as a way to put “on hold” the workflow dependencies without introducing a new state like available, scheduled, etc. So, you can opt to create your own plugin that considers this case too, allowing you to prune even more records. This bug in the DynamicPruner is fixed in the most recent stable (v1.5.2) release, but at this point, we were using v1.4.14.

Once you have reduced the size of the oban_jobs table, it should be safe to apply the manual migration for step 2, this preferably should be applied at a time when you don’t have too much load in your system, probably during a weekend would be a good idea.

While I understand why the Oban authors decided to avoid creating these indexes concurrently, you must understand that creating these 3 indexes will lock the table for writes (not reads), preventing concurrent inserts, updates, or deletes until the index it’s done. So, we decided to wrap these index creations into their own manual migration and set the option concurrently: true, considering that you must set the module attributes to disable DDL transactions and the migration lock.

Keep in mind that applying the UNIQUE INDEX in the uniq_key column could fail. This means you have duplicate entries for this column, and you have to solve that issue first, and try again.

Once you have applied all these migrations, it should be safe to proceed with the Oban upgrade to v1.5.

3 Likes

The amazing community

2 Likes