Oban.Web crashes Oban queue

We had a little accident with a production system, which usually manages to process the jobs as they come in, with no problems.

We noticed that a bunch of jobs (20k) were scheduled that shouldn’t have been scheduled. Me, and another developer, we opened Oban Web, clicked around, and then I cancelled the 20,000 jobs using iex session with Oban.cancel_all/1. So far so good.

The jobs seem to have disappeared and we kept observing what’s going on using the Oban Web panel.

Quickly I noticed nothing happens. The jobs were not being processed. In the UI, the queue in question was toggling from “paused” to “running” state but there were barely any jobs in “executing” state.

Finally, the queue disappeared from the list of queues entirely in the Oban.Web.

I tried restarting the app, Oban itself, the database, nothing helped - the issue was very quickly reappearing.

When trying to query for the queue status using iex I was getting:

Oban.check_queue(queue: :my_queue)
** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "my_queue"}}}}, :check, 5000)
    ** (EXIT) time out
    (elixir 1.13.4) lib/gen_server.ex:1030: GenServer.call/3

At the same time the CPU usage on our Google Cloud SQL PostgreSQL instance was at 100% and we were getting a lot of disconnected errors from Postgrex - also from other parts of the app.

I finally closed the tab with Oban Web, and asked the other developer to do the same, and the issue resolved itself almost immediately, and the jobs started being executed. Once the queue emptied, so did our CPU usage.

I think what was happening is that Oban.Queue.Producer was getting stuck/timeouts/crashing, as a result of something hammering either database or it from Oban Web. At no point in time we had more than 25k jobs in the system, so even with the moderately small database instance we have on production (6GB one) that should not be the case I think.

Are we missing some indexes that Oban.Web needs to use to query the system effectively? @sorentwo any ideas?

I think canceling so many jobs at once may have flooded the Oban.Queue.Producer processes with PubSub notifications. That would saturate each producer’s mailbox, which would then cause all of the gossip checks to fail like you noticed because it is effectively using check_queue. Without the gossip messages, the Web interface wouldn’t keep updating, and the producer crashes could have cascaded and spiked your CPU usage.

Web searches are the only query that isn’t always indexed, which may impact performance. There is an open issue about search improvements (though I doubt that’s the underlying issue here).

This makes sense… but needs fixing, right?

wt., 13 wrz 2022, 20:02 użytkownik Sorentwo via Elixir Programming Language Forum <noreply@elixirforum.com> napisał:

Sounds like it. Any recreation you can provide would help a lot toward that effort.

@sorentwo we had another accident of similar type and this time I was observing it more closely. I think it may nothing to do with the Web UI.

Again, someone generated a bunch of jobs, that I deleted. Overall, 30k jobs were deleted.

The remaining 12k jobs accumulated overnight that should be processed.

What happened is that the system was not keeping up with processing these jobs now, as the new jobs arrived, so after a couple of hours we were up at 13k jobs and growing.

Investigating, I found that PID of Oban.Producers were indeed crashing, with their message queues indeed growing, but I don’t think I ever saw a message queue longer than 300, and they were actually stuck at this SQL query:

UPDATE "public"."oban_jobs" AS o0 SET "state" = $1, "attempted_at" = $2, "attempted_by" = $3, "attempt" = o0."attempt" + $4 WHERE (o0."id" IN (SELECT ss0."id" FROM (SELECT sso0."id" AS "id", sso0."priority" AS "priority", sso0."scheduled_at" AS "scheduled_at", sso0."worker" AS "worker", sso0."args" AS "args", dense_rank() OVER "partition" AS "rank" FROM "public"."oban_jobs" AS sso0 WHERE ((sso0."state" = 'available') AND (sso0."queue" = $5)) WINDOW "partition" AS (PARTITION BY sso0."args"->>$6 ORDER BY sso0."priority", sso0."scheduled_at", sso0."id")) AS ss0 WHERE (((ss0."rank" <= $7) AND ss0."args" @> $8) OR (((ss0."rank" <= $9) AND ss0."args" @> $10) OR (((ss0."rank" <= $11) AND ss0."args" @> $12) OR (((ss0."rank" <= $13) AND ss0."args" @> $14) OR (((ss0."rank" <= $15) AND ss0."args" @> $16) OR (((ss0."rank" <= $17) AND ss0."args" @> $18) OR (((ss0."rank" <= $19) AND ss0."args" @> $20) OR (((ss0."rank" <= $21) AND ss0."args" @> $22) OR (((ss0."rank" <= $23) AND ss0."args" @> $24) OR (((ss0."rank" <= $25) AND ss0."args" @> $26) OR (((ss0."rank" <= $27) AND ss0."args" @> $28) OR (((ss0."rank" <= $29) AND ss0."args" @> $30) OR (((ss0."rank" <= $31) AND ss0."args" @> $32) OR (((ss0."rank" <= $33) AND ss0."args" @> $34) OR (((ss0."rank" <= $35) AND ss0."args" @> $36) OR (((ss0."rank" <= $37) AND ss0."args" @> $38) OR (((ss0."rank" <= $39) AND ss0."args" @> $40) OR (((ss0."rank" <= $41) AND ss0."args" @> $42) OR (((ss0."rank" <= $43) AND ss0."args" @> $44) OR (((ss0."rank" <= $45) AND ss0."args" @> $46) OR (NOT (ss0."args" @> $47) AND (NOT (ss0."args" @> $48) AND (NOT (ss0."args" @> $49) AND (NOT (ss0."args" @> $50) AND (NOT (ss0."args" @> $51) AND (NOT (ss0."args" @> $52) AND (NOT (ss0."args" @> $53) AND (NOT (ss0."args" @> $54) AND (NOT (ss0."args" @> $55) AND (NOT (ss0."args" @> $56) AND (NOT (ss0."args" @> $57) AND (NOT (ss0."args" @> $58) AND (NOT (ss0."args" @> $59) AND (NOT (ss0."args" @> $60) AND (NOT (ss0."args" @> $61) AND (NOT (ss0."args" @> $62) AND (NOT (ss0."args" @> $63) AND (NOT (ss0."args" @> $64) AND (NOT (ss0."args" @> $65) AND (NOT (ss0."args" @> $66) AND (ss0."rank" <= $67))))))))))))))))))))))))))))))))))))))))) ORDER BY ss0."priority", ss0."scheduled_at", ss0."id" LIMIT $68)) RETURNING o0."id", o0."state", o0."queue", o0."worker", o0."args", o0."meta", o0."tags", o0."errors", o0."attempt", o0."attempted_by", o0."max_attempts", o0."priority", o0."attempted_at", o0."cancelled_at", o0."completed_at", o0."discarded_at", o0."inserted_at", o0."scheduled_at"

This SQL query was taking like 12s until got canceled / Producer crashed & restarted. The thing repeated over and over again until I removed the rate-limiting from our Oban config.

The rate limit config in question is:

    my_queue: [    
      global_limit: 50,    
      local_limit: 20,    
      rate_limit: [    
        allowed: 30,
        period: {1, :minute},
        partition: [fields: [:args], keys: [:account_id]]
      ]
    ]

After removing the rate limiter the jobs were processed properly.

It looks to me that rate limiter SQL query is very very slow if the queue size becomes large. It works well if the queues are being emptied, but if they grow, it becomes a problem.

2 Likes

Thank you for the investigation and diagnosis. That is extremely helpful!

I’ll see what can be done to improve that query’s performance or avoid such an expensive query when the number of available jobs is so high.

Yeah… unfortunately we rely on Oban/queue being the buffer where we do expect the jobs to accumulate from time to time - in a different conditions than above. The system talks to external APIs and the queue is there to buffer up the jobs in case these APIs have intermittent issues, we don’t want to drop anything and just queue it up and postpone the job execution if we detect the failure of the API.

Back to the possible solution: I think we don’t have any indices on the partition field in the “args” JSONB field, and I think we can add one. I will run some tests on how this improves the query performance, maybe it’s all we need.

@sorentwo I spent some time trying to optimize the query with some success. Creating an index on the argument we’re partitioning by (tenant_id) allows me to schedule slightly more jobs before things go south, by about 50%.

This is not a solution, however, as it still crashes - just slightly later.

For now I will stop using the rate limit feature as it’s doing more damage than good in our use case.

I think the query, for our case, could be simplified if the rate tracking was was done in an Elixir process, and when we detect that something exceeds limit - we give it to the Oban.Producer to exclude certain things from query. In fact, we have all the building blocks to do that except for dynamically telling producer how to alter it’s query to fetch next jobs (I think it’s not possible).

1 Like

@hubertlepicki Understood, sorry for the trouble.

I’ve replicated your situation and I’m working through a few solutions currently. I’ll follow up when I have some benchmarks or query plan improvements.

4 Likes

Wow this is awesome. Let me know if there’s something I can test.

Good news! An accurate but partial query variant is 114x faster (from 77.551ms down to 0.653ms) on my development machine. Here’s the explain.depesz for reference. Granted, this is on a local DB instance without any other traffic, but the query plan should hold :crossed_fingers: .

We’re aiming for a new OSS/Web/Pro release tomorrow or maybe over the weekend depending on how quickly I can translate the raw query and cover edge cases.

4 Likes

Today I learned about explain.depesz.com – nice nugget, thank you.

2 Likes

Yep. Me too. I wish to see the SQL that got optimized, but @sorentwo didn’t upload. I’ll have to investigate when this gets released as it’s super interesting.

You’re correct, I didn’t upload a revision. I’ll share the final query (privately) along with the release. Speaking of; working through the edge cases and a few other little bugs have delayed the next Pro release until this Friday.

3 Likes