Que - Elixir Job Processing with Mnesia

sheharyarn · April 30, 2017, 11:40pm

A few months back I created this thread asking everyone’s opinion on different background job processing libraries available for Elixir and was encouraged to go with a simple GenServer at the time.

I eventually needed to persist job state and was reminded of this table from @sasajuric’s Elixir in Action (and his tweet):

So I decided to stay in BEAM-land and used Mnesia in my project for a while before eventually releasing it as (yet another) background job processing library.

I would really appreciate if I could get the community’s feedback on the project implementation and some pointers, especially on:

The way child processes are currently handled
- Supervisor
- Task
Testing the Supervisor and GenServer
- What’s the right way of testing something like this?
Adding Delayed Jobs
- Should I create another GenServer for Delayed Jobs or modify the current one?

Thank you!

You can check out the project on Github:

benwilson512 · May 1, 2017, 8:06am

This definitely seems interesting. Just a few comments:

It seems like the workers are configured under the Que application and not the user application. This can be an issue if the tasks use the user’s database for example.

Que starts, grabs a persisted job, and tries to run it. If the job talks to the database immediately the user’s Repo process may not yet be up.

Unfortunately your Queue module isn’t a queue, it’s a stack. https://github.com/sheharyarn/que/blob/master/lib/que/queue.ex#L72-L89. You’re placing new jobs on the front of a list, and then pulling jobs off also from the front of a list, treating it as a stack not a queue.
There seems to be no real isolation between queues. This is problematic for a variety of reasons. From a performance perspective everything is serialized through your Que.Server genserver, which won’t scale as you have more queues or as particular queues become very busy.

It’s also an issue from a fault tolerance perspective. Any issue with one queue will nuke every other queue.

Isolation between workers is limited. https://github.com/sheharyarn/que/blob/master/lib/que/queue_set.ex#L102 goes through each queue completely synchronously and then just runs https://github.com/sheharyarn/que/blob/master/lib/que/queue.ex#L43 some N jobs in each queue under the main single task supervisor.

Each job is executed in its own process which is good, but by putting them under all the same supervisor with the same settings all you need is 5 job failures in 3 seconds to take down all the other jobs. You can of course just adjust the supervisor settings but even still, different supervisors for different queues would be better.

The root issue is that you do 99% of the stuff with a single genserver. There also appear to be race conditions. For example, two nodes starting at the same time will both query mnesia for incomplete jobs. You do that in a transaction, so one will run after another, but since you don’t update the state in that transaction they’re both still gonna get the same list of incomplete jobs and both try to run them.

I apologize that all of this is so negative. The issue is that while Erlang and Elixir are indeed good fits for the items on that list you have, the reason that they’re good is because Erlang and Elixir offer excellent primitives with which to solve the problem. Those primitives still need to be used correctly, and doing so can actually be pretty hard because the problems themselves are pretty hard.

brightball · May 1, 2017, 1:23pm

Only comment I’d have here is that there is a Ruby library called Que that uses Postgres advisory locks for some speed perks and has gained in popularity a good bit.

Just to avoid naming conflicts it might be a good idea to change the name to something more distinctive? I’d heard that there was potentially going to be an Elixir port of that library and if that happens you’re going to get a lot more confusion. If it was a library from any other language I wouldn’t be that concerned.

sheharyarn · May 1, 2017, 6:27pm

How would I go about implementing it like that?

It is a queue. I’m placing jobs at the end and pulling them from the front. I guess the push and pop method names here are misleading. I should change them to something like in and out.

This is an interesting issue. I don’t know much about OTP to understand how to approach this. Should I create multiple Supervisors under one application, one supervisor supervising multiple supervisors, each for one worker or one supervisor with multiple GenServers each for one Worker? Either way, how can I automatically start the supervisors / genservers for each defined worker? Any tips on doing this the right way?

Not at all. I really appreciate you taking the time out to go through the code and give feedback. My goal is to improve my Elixir and OTP skills.

sheharyarn · May 1, 2017, 6:30pm

I didn’t know about the Ruby Que library. In fact, I “cleverly” came up with the name by replacing K with Q in the Node.js Kue library.

Also, I don’t think there’s a way to change package names once they’ve been published to Hex.

benwilson512 · May 1, 2017, 9:55pm

This is true, my apologies. Unfortunately however its implementation is such that adding N jobs distinctly produces N^2 work. If you do:

for image <- images do
  Que.add(App.Workers.ImageConverter, some_image)
end

you’re going to do N^2 work, when only linear is necessary if using something like :queue, or following its internal implementaiton. Doing ++ to the end of a growing list is generally not what you want to do.

You provide an API like poolboy for example, where you have a child_spec function and then people place it in their supervision tree in their own application.

There’s a lot to say here, I’ll have to reply tomorrow.

OvermindDL1 · May 1, 2017, 10:19pm

This is the use case for a zipper.

sheharyarn · May 3, 2017, 9:04am

Thank you for being patient with me and taking the time out to answer my questions. I went back, did some more research and started refactoring the application trying to implement your suggestions (the OTP-related ones for now):

Created a new ServerSupervisor with :simple_one_for_one strategy that dynamically starts a GenServer child for each worker.
Removed the GenServer from the main application supervision tree and replaced it with ServerSupervisor.
Each spawned Server now handles only one Queue, for that specific worker.

Before I try to do the same for TaskSupervisor, a few questions:

Does this application structure make sense?
For now I’m using {:global, {Que.Server, SomeWorker}} as the GenServer names since it’s simple, but according to this post, using :global introduces extra overhead.
- Should I keep using it?
- Roll my own ServerRegistry GenServer like in this tutorial or go with something like gproc ?
- Use Elixir’s own Registry?
- Or take the simplest approach and use the worker module names as the server names?
When a new job is added, the ServerSupervisor checks if a GenServer for that worker already exists. If not, it spawns one. Is this the right approach?
Should I also keep the same structure for TaskSupervisor?
- i.e. TaskSupervisor > TaskSupervisor.SomeWorker > Tasks

Thanks again.

Qqwy · May 3, 2017, 9:28am

I’d say a zipper is overkill here, as we are only ever appending at the end and reading from the front. You just need one of Okasaki’s purely functional queue algorithms, of which :queue and multiple Elixir wrappers on Hex such as e_queue are implementations.

OvermindDL1 · May 3, 2017, 2:16pm

True true, just of the aspect of doing it manually. And it is not like zippers are difficult, just keep a tuple of two elements, pull off the front of one, push to the front of the other, when the pull one is empty then swap them and reverse, it is quite efficient and simple.