Global_limit regression in 1.6 Pro with workflows?

In 1.5 we used the following pipeline-configuration
pipeline: [local_limit: 4, global_limit: [allowed: 1, partition: [args: [:some_identifier]]]]

This worked as expected, only four workflows at a time ran, partitioned by the args specified.

Essentially we just build a workflow consisting of multiple steps.

Workflow.new(workflow_name: "Workflow")
|> Workflow.add(:instance, StartInstance.new(%{some_identifier: "abc"}))
|> Workflow.add(:do_stuff, DoStuff.new(%{some_identifier: "abc"}), deps: [:instance])
|> Workflow.add(:do_more_stuff, DoMoreStuff.new(%{some_identifier: "abc"}), deps: [:do_stuff])
|> Workflow.add(:do_extra_stuff, DoExtraStuff.new(%{some_identifier: "abc"}), deps: [:do_more_stuff])

So it’s a linear flow, the :instance job has priority of 9 (this is the problem we’re solving for, we can only run four instances at a time) and the others default 0. However, we still see new instance jobs being fired off when the workflow isn’t completed, this was not the case in 1.5 where all jobs in the workflow completed before a new instance job could start. If we look at the available jobs we see all the StartInstance jobs being queued up, but this is as expected(?) and in available we see all of the follow-up jobs that are supposed to happen after the StartInstance jobs.

I’m at a bit of a loss here because I don’t see how this could happen given priorities and limits.

This is probably because partitioning in v1.6 is correct (e.g. actually fair), and in v1.5 it was subtly broken in a way that happened to work for your situation.

Setting the :instance job with a lower priority and the other jobs with a higher priority won’t have much effect if there are multiple workflows enqueued at once, as every job after :instance is placed on hold. Since you have a local limit of 4 (and possibly higher, with multiple nodes), you don’t have a guaranteed processing sequence.

A possible alternatives is to partition by the workflow name and set the global limit to four:

pipeline: [local_limit: 4, global_limit: [allowed: 4, partition: [meta: [:workflow_name]]]]

Then only four jobs of that workflow type can run simultaneously, and the workflow job’s dependencies will naturally sequence them.

2 Likes

Thanks for the pointer, we got a new batch in today and everything is now running as expected using the alternative solution suggested.