Performance issues when setting scheduled/retryable -> available

Graborg · August 21, 2024, 1:50pm

Hi!

We’ve been doing some analyzing of the oban queries hitting our postgres database. We’ve found that every aproximately every 5s an UPDATE is done to set ‘scheduled’/‘retryable’ jobs to ‘available’. However, the query:

  UPDATE
  "public"."oban_jobs" AS o0
SET
  "state" = $1
FROM (
  SELECT
    so0."id" AS "id",
    so0."state" AS "state"
  FROM
    "public"."oban_jobs" AS so0
  WHERE
    (so0."state" IN ($4,
        $5))
    AND (NOT (so0."queue" IS NULL))
    AND (so0."scheduled_at" <= $2)
  LIMIT
    $3) AS s1
WHERE
  (o0."id" = s1."id") RETURNING o0."id",
  o0."queue",
  s1."state"

takes, on avarage, 4 seconds to execute. Somehow, no index seems to be used in the subquery (which seems to be the culprint here). However, just executing the subquery successfully uses oban_jobs_state_queue_priority_scheduled_at_id_index.

We’re not that time dependant that a few seconds really matters for us, but we were curious to see if this is a know issue, or something that we’re doing wrong.

Thank you!

sorenone · August 21, 2024, 2:43pm

Hello!

Would you provide some info about the environment?
Which version of Oban are you using?
How many jobs are in the db?
Are you using many scheduled and retryable jobs?
What’s the size of your db?

It shouldn’t take 4secs. It would help to know more.

Graborg · August 22, 2024, 8:09am

Hi! Sure.

We’re running everything in GCP, so PostgreSQL 15.7 in Cloud SQL

vCPUs
    2 
Memory
    7.5 GB 
SSD storage
    20 GB

On the elixir side we’re running Cloud Run, elixir 1.17, erlang 27.0.1,

Oban 2.17.

scheduled: 7_692_357 
executing: 6_815
retryable: 25
completed: 3_809_248
discarded: 20

Yes, especially the scheduled jobs table is big
Around 12 Gb, Oban things is the only thing we store there. This is the size of our indices.

oban_jobs_pkey: 604 MB
oban_jobs_state_queue_priority_scheduled_at_id_index: 1591 MB
oban_jobs_args_index: 5717 MB
oban_jobs_meta_index: 40 MB

Thank you!

sorenone · August 22, 2024, 1:44pm

This helps!

One last question and two suggestions:

Do you always have this amount of scheduled jobs?

Looks like the args index is really large.

Two things can help:

Consider putting less in there.
You may want to run the Reindexer: Oban.Plugins.Reindexer — Oban v2.17.0

Graborg · August 26, 2024, 1:58pm

Yes. We will have a lot of scheduled jobs.

Yeah. It seems the reindexer might work. We seem to be able to shrink the index to around 1 GB. Does that sound more reasonable?

Graborg · August 28, 2024, 12:00pm

The reindexer reduced the size of our index to around 1 GB. But the performance issues remain. We currently have a rentention of one week. Is it recommended to have a low retention period? Or would that not impact the performance in any meaningful way?

Graborg · August 30, 2024, 8:13am

About 4 hours after reindexing, the query planner seem to have changed. The query is now using the right index and the queries are visibly faster. Thank you!

Edit: I think I’ll start a new thread for this specific problem.

sorenone · September 2, 2024, 2:21pm

Sorry that it took a bit to respond! We were at ElxirConf in Orlando.

A lower retention period can really help performance. It comes down to how many jobs you’re running, more than the total time.

You are most welcome. Great to hear that it worked, that’s marvelous!

New indexes won’t impact the planner until the next vacuum analyze. Ecto also caches prepared statements, and those won’t change until the connections that own them are closed (or the app restarts).

Watching for this^.