Unexplained rollback after 2.21 upgrade

Soon after updating our Oban mix spec to 2.21, an error report came in about a unexpected return from a Multi transaction involving an Oban insert. None of the changes had any errors so there was no immediately explanation what was causing the error in the App/Ecto logs, just the rollback. After dropping down to pg logs the error was clearly a bad/missing migration to Oban v14 schema, causing an outdated enum definition that was breaking the insert. Is that expected? Our expectation was that class of error would be surfaced in Oban’s query execution or insert return value.

It’s expected because the changeset is valid, and it doesn’t know there’s anything wrong until it attempts to insert and a value is missing from the database enum. That’s unexpected, and something you should be able to identify in the test environment.

We’ve just added a migration verification step to help catch this automatically in tests/CI: Verify migrations at startup in testing mode · oban-bg/oban@6dbea1f · GitHub

1 Like

Unfortunately I don’t think tests would have helped in our case. On further investigate, the migration actually did run, but we are running a multi-tenant system, and forgot to provide the prefix option. As a result, the public schema was at the correct version. It was only when the insert was executed with a non-public prefix did we run into a problem. Definitely not an Oban thing, we run into this issue in various aspects of the system, just part of the maintenance burden of trying to keep multiple schemas in sync. What would have helped us though is if the specific reason for the error had bubbled up to the transaction. That’s what usually happens when a particular schema is out of sync with public, causing an otherwise valid query to fail.

Did some digging, and I’m wondering if the issue could be related to the retry logic in Oban.Repo.transaction? It looks like that was added a while ago but we actually just upgraded from 2.17 all the way to 2.20 so that would explain why we only ran into this now. It seems like what might be happening here is that the original error (about the outdated schema) is being lost when the retries are exhausted, possibly related to the nested transaction?

The change doesn’t require you to run any tests, it automatically checks the migration status when the app starts with oban in a test mode. If you start oban for other prefixes, then it would catch the lack of a migration.

That’s possible, but the original error should bubble up after retry attempts are exhausted.

Not quite following you here. Change references testing mode, but in test env we load the db from a structure file which is based on public, so it wouldn’t have caught a migration issue specific to a tenant, as this was.

the original error should bubble up after retry attempts are exhausted.

If this is the case, then it wouldn’t have caused our issue at least. But in preliminary testing, it seems that in specific conditions this does not happen (inside Multi transaction): Addresses nested transaction losing error information by tfwright · Pull Request #1432 · oban-bg/oban · GitHub

What seems to be happening there is the expected Postgrex error (with specific message about error condition) mutates into a DBConnection error (without context)

No, I suppose it wouldn’t have caught the issue then.

This adds to the evidence that it’s better to use Repo.transact/1 rather than a multi, which have plenty of little edge-cases that can trip you up (not using the prefix passed to the transaction it’s ran in, for example).

If I am understanding the problem correctly (and I’m honestly not at 100% there) I don’t think Multi is at fault here. Repo.transact (and Repo.transaction) would have the same problem since they all share the same SQL transaction handling. It appears to mark the connection as failed and that is what triggers the rollback on next (or possibly final?) retry: db_connection/lib/db_connection.ex at 165ce62b9636ae56d4a0f0810c2aa3803b3cff1e · elixir-ecto/db_connection · GitHub

run/3 and transaction/3 can be nested multiple times. If a transaction is
rolled back or a nested transaction fun raises the transaction is marked as
failed. All calls except run/3, transaction/3, rollback/2, close/3 and
close!/3 will raise an exception inside a failed transaction until the outer
transaction call returns. All transaction/3 calls will return
{:error, :rollback} if the transaction failed or connection closed and
rollback/2 is not called for that transaction/3.

And some logs visualizing the transformation of the error. Looks like it does pass through until the final retry (also a short circuit circuit would be preferable)

[Oban.Repo.transaction] attempt=1
[Oban.Repo.transaction] rescued error=Postgrex.Error message=ERROR 22P02 (invalid_text_representation) invalid input value for enum oban_job_state: “bogus”
*[Oban.Repo.transaction] attempt=2
[Oban.Repo.transaction] rescued error=DBConnection.ConnectionError message=transaction rolling back
[Oban.Repo.transaction] attempt=3
[Oban.Repo.transaction] rescued error=DBConnection.ConnectionError message=connection is closed because of an error, disconnect or timeout
[Oban.Repo.transaction] attempt=4
[Oban.Repo.transaction] rescued error=DBConnection.ConnectionError message=connection is closed because of an error, disconnect or timeout
[Oban.Repo.transaction] attempt=5
[Oban.Repo.transaction] rescued error=DBConnection.ConnectionError message=connection is closed because of an error, disconnect or timeout
[Oban.Repo.transaction] max retries reached, reraising
Expected exception Postgrex.Error but got DBConnection.ConnectionError (connection is closed because of an error, disconnect or timeout)