Processing jobs during deploy

I am looking for a feedback on how we solve situations we experience with exq:

Situation 1.
On start-up of the node, exq will move any entries in the “backup” queue to the main queue. The issue is when we deploy new version of the app, we have a new instance of the app running, but with the same node id. So exq removes job from the “backup” queue and it’s being performed by new instance, while the older node is still running the job which completes successfully. So we have job executed twice.
Solution:
As suggested in the docs, we can implement a unique node identifier, so then on each deployment we will not touch jobs for previous deployment node.

That leads us to situation 2.
Node 1 (older node) got terminated without finishing all it’s job :astonished:
Solution:
Implement backup queue cleaner for previous node. Backup cleaner is a GenServer process which after configurable period of time wakes up, find previous node_id and moves all job belonging to that node_id into a main queue.

Question is, are there a better way to make sure that we don’t re-queuing in-progress jobs during deployment?

1 Like

Why not have it not shut down until it’s jobs finish processing, but rather just have it stop taking in new jobs?

1 Like

Exq does that already. We can specify shutdown timeout and the older version of the app will allow that time to finish jobs and not pick up any new ones. Once all jobs finished it will shutdown. But if the job hasn’t finished and we reached the timeout, then it will shutdown and leave the job not finished. Maybe we are overthinking it… I just see that scenario possible.

Pardon the uninformed question, but why do you need the backup queue at all? What’s so radical about having to restart a node that you need a separate queue?

If you give us some more context, we can come up with more ideas.

Well, that’s how exq with redis works. It uses backup queues to manage in-progress jobs. When exq performs a job, it takes the job request off the queue and places it into a “backup” queue with a node identifier that will process the job. If the job completes successfully it will remove it from the “backup” queue. If the node crashes for some reason it will remain in the “backup” queue. On start-up of the node, it will move any entries in the “backup” queue and move it back to the main queue, where it will be processed again. The issue is, that when we deploy we have a new instance running, but with the same node id. So it assumes that there has been a crash, removes it from the “backup” queue so the job is performed again, while the older node is still running the job which completes successfully. So, we end up processing job twice. And above I’ve described how we go about that. I think I am pretty happy with the solution we came up. Just wanted to see if that’s an idiomatic way to solve this.

1 Like

So why not combine node ID with a deployment or instance ID (which should change on every start-up) to achieve a unique ID? Or would that break any backwards compatibility with Resque and Sidekiq?

1 Like

This is likely what I would do.

Is in order processing important? If it is, then the cleaner would be insufficient. In that case you’d probably want the node name to stay the same and make your jobs idempotent or implement an exactly once guarantee.

1 Like