hubertlepicki

Oban.Web crashes Oban queue

We had a little accident with a production system, which usually manages to process the jobs as they come in, with no problems.

We noticed that a bunch of jobs (20k) were scheduled that shouldn’t have been scheduled. Me, and another developer, we opened Oban Web, clicked around, and then I cancelled the 20,000 jobs using iex session with Oban.cancel_all/1. So far so good.

The jobs seem to have disappeared and we kept observing what’s going on using the Oban Web panel.

Quickly I noticed nothing happens. The jobs were not being processed. In the UI, the queue in question was toggling from “paused” to “running” state but there were barely any jobs in “executing” state.

Finally, the queue disappeared from the list of queues entirely in the Oban.Web.

I tried restarting the app, Oban itself, the database, nothing helped - the issue was very quickly reappearing.

When trying to query for the queue status using iex I was getting:

Oban.check_queue(queue: :my_queue)
** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "my_queue"}}}}, :check, 5000)
    ** (EXIT) time out
    (elixir 1.13.4) lib/gen_server.ex:1030: GenServer.call/3

At the same time the CPU usage on our Google Cloud SQL PostgreSQL instance was at 100% and we were getting a lot of disconnected errors from Postgrex - also from other parts of the app.

I finally closed the tab with Oban Web, and asked the other developer to do the same, and the issue resolved itself almost immediately, and the jobs started being executed. Once the queue emptied, so did our CPU usage.

I think what was happening is that Oban.Queue.Producer was getting stuck/timeouts/crashing, as a result of something hammering either database or it from Oban Web. At no point in time we had more than 25k jobs in the system, so even with the moderately small database instance we have on production (6GB one) that should not be the case I think.

Are we missing some indexes that Oban.Web needs to use to query the system effectively? @sorentwo any ideas?

14 comments

/oban #oban-pro

23 1737 14

2022-10-14 08:36:28 UTC

Most Liked

sorentwo

Oban Core Team

@hubertlepicki Understood, sorry for the trouble.

I’ve replicated your situation and I’m working through a few solutions currently. I’ll follow up when I have some benchmarks or query plan improvements.

Post #9

sorentwo

Oban Core Team

Good news! An accurate but partial query variant is 114x faster (from 77.551ms down to 0.653ms) on my development machine. Here’s the explain.depesz for reference. Granted, this is on a local DB instance without any other traffic, but the query plan should hold .

We’re aiming for a new OSS/Web/Pro release tomorrow or maybe over the weekend depending on how quickly I can translate the raw query and cover edge cases.

Post #11

hubertlepicki

@sorentwo we had another accident of similar type and this time I was observing it more closely. I think it may nothing to do with the Web UI.

Again, someone generated a bunch of jobs, that I deleted. Overall, 30k jobs were deleted.

The remaining 12k jobs accumulated overnight that should be processed.

What happened is that the system was not keeping up with processing these jobs now, as the new jobs arrived, so after a couple of hours we were up at 13k jobs and growing.

Investigating, I found that PID of Oban.Producers were indeed crashing, with their message queues indeed growing, but I don’t think I ever saw a message queue longer than 300, and they were actually stuck at this SQL query:

UPDATE "public"."oban_jobs" AS o0 SET "state" = $1, "attempted_at" = $2, "attempted_by" = $3, "attempt" = o0."attempt" + $4 WHERE (o0."id" IN (SELECT ss0."id" FROM (SELECT sso0."id" AS "id", sso0."priority" AS "priority", sso0."scheduled_at" AS "scheduled_at", sso0."worker" AS "worker", sso0."args" AS "args", dense_rank() OVER "partition" AS "rank" FROM "public"."oban_jobs" AS sso0 WHERE ((sso0."state" = 'available') AND (sso0."queue" = $5)) WINDOW "partition" AS (PARTITION BY sso0."args"->>$6 ORDER BY sso0."priority", sso0."scheduled_at", sso0."id")) AS ss0 WHERE (((ss0."rank" <= $7) AND ss0."args" @> $8) OR (((ss0."rank" <= $9) AND ss0."args" @> $10) OR (((ss0."rank" <= $11) AND ss0."args" @> $12) OR (((ss0."rank" <= $13) AND ss0."args" @> $14) OR (((ss0."rank" <= $15) AND ss0."args" @> $16) OR (((ss0."rank" <= $17) AND ss0."args" @> $18) OR (((ss0."rank" <= $19) AND ss0."args" @> $20) OR (((ss0."rank" <= $21) AND ss0."args" @> $22) OR (((ss0."rank" <= $23) AND ss0."args" @> $24) OR (((ss0."rank" <= $25) AND ss0."args" @> $26) OR (((ss0."rank" <= $27) AND ss0."args" @> $28) OR (((ss0."rank" <= $29) AND ss0."args" @> $30) OR (((ss0."rank" <= $31) AND ss0."args" @> $32) OR (((ss0."rank" <= $33) AND ss0."args" @> $34) OR (((ss0."rank" <= $35) AND ss0."args" @> $36) OR (((ss0."rank" <= $37) AND ss0."args" @> $38) OR (((ss0."rank" <= $39) AND ss0."args" @> $40) OR (((ss0."rank" <= $41) AND ss0."args" @> $42) OR (((ss0."rank" <= $43) AND ss0."args" @> $44) OR (((ss0."rank" <= $45) AND ss0."args" @> $46) OR (NOT (ss0."args" @> $47) AND (NOT (ss0."args" @> $48) AND (NOT (ss0."args" @> $49) AND (NOT (ss0."args" @> $50) AND (NOT (ss0."args" @> $51) AND (NOT (ss0."args" @> $52) AND (NOT (ss0."args" @> $53) AND (NOT (ss0."args" @> $54) AND (NOT (ss0."args" @> $55) AND (NOT (ss0."args" @> $56) AND (NOT (ss0."args" @> $57) AND (NOT (ss0."args" @> $58) AND (NOT (ss0."args" @> $59) AND (NOT (ss0."args" @> $60) AND (NOT (ss0."args" @> $61) AND (NOT (ss0."args" @> $62) AND (NOT (ss0."args" @> $63) AND (NOT (ss0."args" @> $64) AND (NOT (ss0."args" @> $65) AND (NOT (ss0."args" @> $66) AND (ss0."rank" <= $67))))))))))))))))))))))))))))))))))))))))) ORDER BY ss0."priority", ss0."scheduled_at", ss0."id" LIMIT $68)) RETURNING o0."id", o0."state", o0."queue", o0."worker", o0."args", o0."meta", o0."tags", o0."errors", o0."attempt", o0."attempted_by", o0."max_attempts", o0."priority", o0."attempted_at", o0."cancelled_at", o0."completed_at", o0."discarded_at", o0."inserted_at", o0."scheduled_at"

This SQL query was taking like 12s until got canceled / Producer crashed & restarted. The thing repeated over and over again until I removed the rate-limiting from our Oban config.

The rate limit config in question is:

    my_queue: [    
      global_limit: 50,    
      local_limit: 20,    
      rate_limit: [    
        allowed: 30,
        period: {1, :minute},
        partition: [fields: [:args], keys: [:account_id]]
      ]
    ]

After removing the rate limiter the jobs were processed properly.

It looks to me that rate limiter SQL query is very very slow if the queue size becomes large. It works well if the queues are being emptied, but if they grow, it becomes a problem.

Post #5