hubertlepicki

hubertlepicki

Oban.Web crashes Oban queue

We had a little accident with a production system, which usually manages to process the jobs as they come in, with no problems.

We noticed that a bunch of jobs (20k) were scheduled that shouldn’t have been scheduled. Me, and another developer, we opened Oban Web, clicked around, and then I cancelled the 20,000 jobs using iex session with Oban.cancel_all/1. So far so good.

The jobs seem to have disappeared and we kept observing what’s going on using the Oban Web panel.

Quickly I noticed nothing happens. The jobs were not being processed. In the UI, the queue in question was toggling from “paused” to “running” state but there were barely any jobs in “executing” state.

Finally, the queue disappeared from the list of queues entirely in the Oban.Web.

I tried restarting the app, Oban itself, the database, nothing helped - the issue was very quickly reappearing.

When trying to query for the queue status using iex I was getting:

Oban.check_queue(queue: :my_queue)
** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "my_queue"}}}}, :check, 5000)
    ** (EXIT) time out
    (elixir 1.13.4) lib/gen_server.ex:1030: GenServer.call/3

At the same time the CPU usage on our Google Cloud SQL PostgreSQL instance was at 100% and we were getting a lot of disconnected errors from Postgrex - also from other parts of the app.

I finally closed the tab with Oban Web, and asked the other developer to do the same, and the issue resolved itself almost immediately, and the jobs started being executed. Once the queue emptied, so did our CPU usage.

I think what was happening is that Oban.Queue.Producer was getting stuck/timeouts/crashing, as a result of something hammering either database or it from Oban Web. At no point in time we had more than 25k jobs in the system, so even with the moderately small database instance we have on production (6GB one) that should not be the case I think.

Are we missing some indexes that Oban.Web needs to use to query the system effectively? @sorentwo any ideas?

Most Liked

sorentwo

sorentwo

Oban Core Team

@hubertlepicki Understood, sorry for the trouble.

I’ve replicated your situation and I’m working through a few solutions currently. I’ll follow up when I have some benchmarks or query plan improvements.

sorentwo

sorentwo

Oban Core Team

Good news! An accurate but partial query variant is 114x faster (from 77.551ms down to 0.653ms) on my development machine. Here’s the explain.depesz for reference. Granted, this is on a local DB instance without any other traffic, but the query plan should hold :crossed_fingers: .

We’re aiming for a new OSS/Web/Pro release tomorrow or maybe over the weekend depending on how quickly I can translate the raw query and cover edge cases.

hubertlepicki

hubertlepicki

@sorentwo we had another accident of similar type and this time I was observing it more closely. I think it may nothing to do with the Web UI.

Again, someone generated a bunch of jobs, that I deleted. Overall, 30k jobs were deleted.

The remaining 12k jobs accumulated overnight that should be processed.

What happened is that the system was not keeping up with processing these jobs now, as the new jobs arrived, so after a couple of hours we were up at 13k jobs and growing.

Investigating, I found that PID of Oban.Producers were indeed crashing, with their message queues indeed growing, but I don’t think I ever saw a message queue longer than 300, and they were actually stuck at this SQL query:

UPDATE "public"."oban_jobs" AS o0 SET "state" = $1, "attempted_at" = $2, "attempted_by" = $3, "attempt" = o0."attempt" + $4 WHERE (o0."id" IN (SELECT ss0."id" FROM (SELECT sso0."id" AS "id", sso0."priority" AS "priority", sso0."scheduled_at" AS "scheduled_at", sso0."worker" AS "worker", sso0."args" AS "args", dense_rank() OVER "partition" AS "rank" FROM "public"."oban_jobs" AS sso0 WHERE ((sso0."state" = 'available') AND (sso0."queue" = $5)) WINDOW "partition" AS (PARTITION BY sso0."args"->>$6 ORDER BY sso0."priority", sso0."scheduled_at", sso0."id")) AS ss0 WHERE (((ss0."rank" <= $7) AND ss0."args" @> $8) OR (((ss0."rank" <= $9) AND ss0."args" @> $10) OR (((ss0."rank" <= $11) AND ss0."args" @> $12) OR (((ss0."rank" <= $13) AND ss0."args" @> $14) OR (((ss0."rank" <= $15) AND ss0."args" @> $16) OR (((ss0."rank" <= $17) AND ss0."args" @> $18) OR (((ss0."rank" <= $19) AND ss0."args" @> $20) OR (((ss0."rank" <= $21) AND ss0."args" @> $22) OR (((ss0."rank" <= $23) AND ss0."args" @> $24) OR (((ss0."rank" <= $25) AND ss0."args" @> $26) OR (((ss0."rank" <= $27) AND ss0."args" @> $28) OR (((ss0."rank" <= $29) AND ss0."args" @> $30) OR (((ss0."rank" <= $31) AND ss0."args" @> $32) OR (((ss0."rank" <= $33) AND ss0."args" @> $34) OR (((ss0."rank" <= $35) AND ss0."args" @> $36) OR (((ss0."rank" <= $37) AND ss0."args" @> $38) OR (((ss0."rank" <= $39) AND ss0."args" @> $40) OR (((ss0."rank" <= $41) AND ss0."args" @> $42) OR (((ss0."rank" <= $43) AND ss0."args" @> $44) OR (((ss0."rank" <= $45) AND ss0."args" @> $46) OR (NOT (ss0."args" @> $47) AND (NOT (ss0."args" @> $48) AND (NOT (ss0."args" @> $49) AND (NOT (ss0."args" @> $50) AND (NOT (ss0."args" @> $51) AND (NOT (ss0."args" @> $52) AND (NOT (ss0."args" @> $53) AND (NOT (ss0."args" @> $54) AND (NOT (ss0."args" @> $55) AND (NOT (ss0."args" @> $56) AND (NOT (ss0."args" @> $57) AND (NOT (ss0."args" @> $58) AND (NOT (ss0."args" @> $59) AND (NOT (ss0."args" @> $60) AND (NOT (ss0."args" @> $61) AND (NOT (ss0."args" @> $62) AND (NOT (ss0."args" @> $63) AND (NOT (ss0."args" @> $64) AND (NOT (ss0."args" @> $65) AND (NOT (ss0."args" @> $66) AND (ss0."rank" <= $67))))))))))))))))))))))))))))))))))))))))) ORDER BY ss0."priority", ss0."scheduled_at", ss0."id" LIMIT $68)) RETURNING o0."id", o0."state", o0."queue", o0."worker", o0."args", o0."meta", o0."tags", o0."errors", o0."attempt", o0."attempted_by", o0."max_attempts", o0."priority", o0."attempted_at", o0."cancelled_at", o0."completed_at", o0."discarded_at", o0."inserted_at", o0."scheduled_at"

This SQL query was taking like 12s until got canceled / Producer crashed & restarted. The thing repeated over and over again until I removed the rate-limiting from our Oban config.

The rate limit config in question is:

    my_queue: [    
      global_limit: 50,    
      local_limit: 20,    
      rate_limit: [    
        allowed: 30,
        period: {1, :minute},
        partition: [fields: [:args], keys: [:account_id]]
      ]
    ]

After removing the rate limiter the jobs were processed properly.

It looks to me that rate limiter SQL query is very very slow if the queue size becomes large. It works well if the queues are being emptied, but if they grow, it becomes a problem.

Where Next?

Popular in Questions Top

rms.mrcs
Hi, I need to transform a list of numbers into a map where the keys are the indexes and the values are the original values of the list. ...
New
lucidguppy
I have a super simple question about elixir - how would I take a file like this foo bar baz and output a new file that enumerates th...
New
Qqwy
Original source of discussion: This topic on the Pragmatic Programmers’ Functional Web Development with Elixir, OTP, and Phoenix forum. ...
New
jononomo
For some reason my phoenix channels are working for me in my local dev environment, but as soon as I deploy via Docker, I get a 403 error...
New
lanycrost
Hi everyone! I need implement if…else if…else condition from my elixir code, and anymore of this control flow structures not work proper...
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
jay1
Why is it that the mnesia database isn’t the most preferred database for use in Elixir/Phoenix?
New
earth10
Hi, I’m just starting to build a side-project with Elixir and Phoenix and doing some basic test with Elixir alone. What strikes me is th...
New
shahryarjb
Hello, I have map which I want to convert it to string like this: the map: %{last_name: "tavakkoli", name: "shahryar"} the string I ne...
New
JorisKok
I have a server on AWS, and was running a load test using artillery. When looking at the Phoenix dashboard I see the Ports going to 100% ...
New

Other popular topics Top

vegabook
I’m brand new to Phoenix and I have stripped one of the demo applications to the bone. I just want to get an svg up on the screen. Here i...
New
New
Brian
What is the proper way to load a module from a file in to IEX? In the python world, doing something like this pretty standard: from ....
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
dokuzbir
I want to highlight html closing tags when i click a html tag. That works in .html files but doesnt work for html.eex templates. How can...
New
chrismccord
Phoenix 1.4.0 released Phoenix 1.4 is out! This release ships with exciting new features, most notably with HTTP2 support, improved deve...
688 30877 112
New
shijith.k
I am trying to start a new phoenix project with elixir 1.9, but mix phx.new does not work. It says that ** (Mix) The task "phx.new" could...
New
aalberti333
As the title describes, I’m trying to run Enum.map() over a list of key/value pairs, where the value is a map. My data looks like this: ...
New
alice
Hey, Just curious what are the main benefits of Elixir compared to Clojure? When is Elixir more useful than Clojure and vice versa? Th...
New
siddhant3030
Hi, I have to write a raw query for one of my project. But till now I have used ecto queries and don’t have much experience writing raw ...
New

We're in Beta

About us Mission Statement