Oban producer doesn't start

thiagogsr · September 18, 2024, 1:14pm

Hi there,

After enqueuing a Oban workflow with ~950k jobs, the queue disappeared from the web dashboard. I’ve checked the database and there isn’t a single producer for this queue. Is there any runbook to troubleshoot it?

The workflow has 1 job in available state and all the others as scheduled, on hold, because all of them depends on this one in available. The one in available state has attempt equals to 2, but no errors. The attempted_by has the node name. It seems the attempt equals to 2 is because the job was somehow rescued.

sorentwo · September 18, 2024, 2:07pm

That’s a lot of jobs for a single workflow. I imagine the query that progresses the workflow is failing and that’s what’s crashing the queue. As of Oban v2.18.3 and Pro v1.5.0-rc.3 the queue should keep restarting, but that’s not going to help you if the query keeps crashing.

That’s a lot of jobs for a workflow—like, an order of magnitude more than the maximum I’ve heard of before.

What’s the shape of the workflow? Is it all linear, where each job runs in sequence? If so, chaining might perform better.

It sounds like the job was definitely rescued. There will also be a rescued count in the job’s meta that can verify that.

thiagogsr · September 18, 2024, 2:24pm

The shape is:

First job depends on nothing;
All the remaining jobs, except the last two ones, depend on the first;
The last two jobs depend on all previous jobs

We’ve already considered all Oban worker types, and workflow was the only one that solved all the needs. It’s a very solid and tested design, it’s very rare to have this amount of jobs.

We are running the latest Oban versions.

Is there a way I can confirm that?

sorentwo · September 18, 2024, 4:41pm

The design makes a lot of sense normally. A workflow of 950k jobs is just beyond the intended use case for a single workflow. I think you’ll need to break it up somehow to get them to process. Otherwise, that first single job is responsible for updating 949,997 other jobs in a single transaction.

If you’re running Pro v1.5.0-rc.3, one other thing you can try is to increase the xact_timeout from the default of 30 seconds to something long enough to handle that many updates:

my_queue: [local_limit: 10, xact_timeout: :timer.minutes(5)]

Yes, there will be an exception raised.

thiagogsr · September 18, 2024, 5:03pm

Understood, thank you.

There is indeed a DBConnection.ConnectionError exception happening. I will delete the jobs and find other strategy to run it.