Running quantum job from single node

aviraj · August 7, 2020, 5:09am

We have elixir application deployed on two servers. The application has been configured with some quantum jobs which execute from both servers at the same time. We want to limit execution of the quantum jobs to a single server. As per documentation, we could see the following options,

Add global?: true to the job configuration
run_strategy: {Quantum.RunStrategy.Random, :cluster}

We tried option (1), it didnt work. The job still executed from both servers. Do both (1) and (2) need to be set at the same time in a job? or there something else missing in the configuration

config :titan, Titan.Scheduler,
  timeout: 20_000,
  global?: true,
  jobs: [
    job_schedule: [
      schedule: "20 6 * * *",
      task: {Titan.ContentSchedule, :schedule_updates, []}
    ],
]

How does this clustering work? Even without giving the list of nodes, how do the two servers communicate among themselves to co-ordinate the job execution?

hubertlepicki · August 7, 2020, 6:05am

Global has been removed in 3.0.0. https://github.com/quantum-elixir/quantum-core/blob/0ac628950b8de0a03ed914f1ca0c77c9ad93c4ac/CHANGELOG.md
which is kind of soliving the problem because the implementation wasn’t great and was causing us a lot of trouble with unexpected behavior.

You probably want to use Oban, which will guarantee scheduled jobs uniqueness on the PostgreSQL level:

You can also use something like https://github.com/SchedEx/SchedEx but you will have to wrap it with own code to make only one process in cluster is started per job, I have been doing that with Horde.DynamicSupervisor. But honestly, if asked to implement it today, I would just use Oban.

aviraj · August 7, 2020, 6:30am

Thanks. I forgot to mention, the project uses quantum 2.2
Right now, we dont have the option to go with another solution, we have to make this work with the current versions of libraries if possible.
Would putting a run strategy work?
run_strategy: {Quantum.RunStrategy.Random, :cluster}

poops · August 7, 2020, 12:18pm

I had this same issue where I had to move off Oban because it locked a table and took our entire application down. Moved to quantum 3 and used highlander to have quantum run on only one process:

hubertlepicki · August 7, 2020, 12:36pm

ah, that’s quite the opposite what I did. I used Horde.DynamicSupervisor to distribute sched_ex processes across cluster. If I understand in your solution there’s always going to be single node that executes these scheduled tasks, all at the same noode.

Interesting to learn about oban locking the scheduled jobs table, I never experienced such issue.

sorentwo · August 7, 2020, 12:42pm

I would like to hear more about this as well. What version of Oban were you using? How many jobs were you scheduling?

Scheduling only uses an advisory lock, which isn’t associated with a table at all. I’m very curious!

aviraj · August 7, 2020, 1:29pm

Any suggestions on how to get the RunStrategy working so that the jobs run on a single node in the cluster? Somehow adding the RunStrategy as randon did not have any effect on the jobs. The jobs were still executing from both nodes

poops · August 7, 2020, 8:45pm

Correct, we didn’t need the crons to be distributed. Just needed to ensure only one process was running them.

poops · August 7, 2020, 8:55pm

Was using oban 2.0.0, oban_pro 0.3.0, and oban_web 2.0.0. It was a recent upgrade to those versions, with the following plugins:

 Oban.Plugins.Pruner,
 Oban.Pro.Plugins.Lifeline,
 Oban.Web.Plugins.Stat

Database CPU was pinned at 100% and our DBA said it was related to some queries to the oban_jobs table.
I don’t have exact numbers as the table eventually got pruned, but I believe there were hundreds of thousands of completed job records. My guess was that the pruning plugin locked things up, but I haven’t had a chance to look into it.

connorlay · August 7, 2020, 10:37pm

Hey! I recently solved this problem of running globally unique jobs in a clustered Elixir application. The project was written using Quantum 2, which had built-in support for clustering via the global: true mode. In our experience the implementation of global jobs was unreliable and we found that many jobs would stop executing entirely.

After doing some research, we ended up moving from Quantum to periodic for the job scheduler, combined with highlander to ensure uniqueness in the cluster.

I have a proof-of-concept here that shows the basic functionality.

Exadra37 · August 7, 2020, 10:57pm

It’s a 404 link, maybe the repo is private?

connorlay · August 7, 2020, 11:00pm

Fixed! Thanks for pointing that out

sorentwo · August 8, 2020, 2:13pm

I haven’t had any other reports of that, I would like to have seen the queries.

That is entirely normal. Even the demo at Jobs • Oban has 600,000 completed jobs sitting around.

Any idea if you were using per-worker or per-state dynamic pruning? The pruner deletes 10k records every one minute by default, which is extremely fast.

poops · August 10, 2020, 1:01pm

Here are the 2 queries our DBA sent me:

UPDATE “public”.“oban_jobs” AS o0 SET “state” = $1 WHERE (o0.“id” IN (SELECT so0.“id” AS “id” FROM “public”.“oban_jobs” AS so0 WHERE (so0.“state” IN (?,?)) AND (so0.“queue” = $2) AND (so0.“scheduled_at” <= $3) FOR UPDATE SKIP LOCKED))

UPDATE “public”.“oban_jobs” AS o0 SET “state” = $1, “attempted_at” = $2, “attempted_by” = $3, “attempt” = o0.“attempt” + $4 WHERE (o0.“id” IN (SELECT so0.“id” AS “id” FROM “public”.“oban_jobs” AS so0 WHERE (so0.“state” = ?) AND (so0.“queue” = $5) ORDER BY so0.“priority”, so0.“scheduled_at”, so0.“id” LIMIT $6 FOR UPDATE SKIP LOCKED)) RETURNING o0.“id”, o0.“state”, o0.“queue”, o0.“worker”, o0.“args”, o0.“errors”, o0.“tags”, o0.“attempt”, o0.“attempted_by”, o0.“max_attempts”, o0.“priority”, o0.“att

I’m not sure, I didn’t originally set it up. Where would that be configured?

sorentwo · August 10, 2020, 7:49pm

Those are the primary queries used to stage scheduled jobs and to fetch them for execution. They’re fully indexed and should be very fast, < 1ms under normal load. How many queues are/were you running? I’d love to know more to help you all, or other people in a similar situation.

It would be configured in the plugins section of your config, using the Dynamic Pruning Plugin. That is all moot though, since the queries you shared aren’t from pruning.

poops · August 11, 2020, 6:51pm

Sure, there were 25 queues running. I also ran SELECT * FROM oban_jobs_id_seq to see the last ID used, and it was 2915796. So we had close to 3 million jobs, although I don’t know how many were in there at the time of the spike.

Ah ok. I assumed it was the pruner because it happened right after enabling it. It could’ve been a coincidence. These were the only three plugins running at the time:

plugins: [
  Oban.Plugins.Pruner,
  Oban.Pro.Plugins.Lifeline,
  Oban.Web.Plugins.Stats
]

bartblast · April 15, 2021, 6:16pm

There is also the Citrine package, that aims to solve the cron clustering problem.