Background job queues: When to use? When not to use? Which one to use?

PragTob · February 26, 2019, 11:28am

Hello everyone,

I know we had quite some threads (read through lots of them) about background job processing but it remains a hotly debated topic and something especially people migrating over from other languages (especially the ones with a GIL like Ruby or Python) have questions about. A friendly company wanting to try out elixir just asked me whether or not they need a background job processing system and if yes which one, I couldn’t give them a great answer.
I thought I’d summarize my understanding, I’m not an expert by any means, and kick off some discussion. This is where you come in. I’d love all your input, especially if my understanding is correct and on what libraries/setups you might have used and recommend.

When to reach for a background job tool

“You don’t need background job processing in Elixir/Erlang” This is a sentiment I read a lot especially in the early days of this wonderful forum. I think it’s somewhat of a misunderstanding. No I don’t need background job processing systems just to achieve parallelism.

What can I easily do in parallel?

I want to do n things in parallel and aggregate results, example: I want to get data from n different data sources (like recommendation engines) and then aggregate them - Task.async + Task.await =
I just want something to happen but don’t care when it finishes - like for instance image processing, some updated caching data…

In summary, probably tasks that likely don’t fail/can be retried immediately and can be done right now, and you can afford to not turn the system off while they are executing. I know hot code upgrades are a thing, but from my understanding if you don’t absolutely need them they’re discouraged for complexity reasons.

What probably needs some more advanced system?
(please correct me if any of these are easily done in just elixir/erlang)

I want the system to be robust to system restarts/system crashes (which shouldn’t happen right ) - because if these happen then you lose the job that was executing and not done (you can stop the server from shutting down afaik though, which helps for restarts)
I want to have exponential back off retries - this means that a retry might happen 2 hours in the future, which wouldn’t be feasible to delay the application restarting for so long
executing jobs in the future at all - a restart would clear these as well so you’d lose them

As a concrete example for what I think I need a background job queue:
I want to notify a partner system of something:

this needs to be delivered at least once
partner systems are down rather frequently sometimes for hours, so I want to retry ~5 times with exponential back off but also have the possibility to retry manually after that

I like how the exq README puts it:

If you need a durable jobs, retries with exponential backoffs, dynamically scheduled jobs in the future - that are all able to survive application restarts, then an externally backed queueing library such as Exq could be a good fit.

Existing queue systems

rihanna - PostgreSQL storage, uses advisory locks
exq - Redis backed, compatible with Sidekiq format - I like the “do you need exq?” section
verk - Redis based as well, also supports sidekiq format
que - backed by Mnesia which is a database builtin to erlang/otp so no extra infrastructure
toniq - uses redis, hasn’t seen an update in over a year
honeydew - pluggable storage, featuring in memory, Mnesia and ecto queues.
ecto_job - backed by PostgreSQL, focussed on transactional behaviour, uses pg_notify so doesn’t do any database polling afaik (might be true for others here I just know this)
kiq - a rather new library, also redis backed and aiming at sidekiq compatibility, it was under heavy development around the jump of the year
faktory_worker_ex - a worker for Mike Perham’s new more server based system faktory - woud especially be interested in opinions/experiences here.
gen_queue - a generic interface to different queue systems mentioned above and others for flexibility

What I find interesting is that our forum discussions are often very focussed on how we can do it just in the BEAM, which I quite like - but we have comparatively little libraries that implement it BEAM/OTP only. Part because people have problems with mnesia. Something that I found in in the discussions but apparently no library to go along with it is using dets for storage.

other things

I find this exchange between @benwilson512 and @sasajuric very interesting

Also in the discussion of course gen_stage comes up, for processing large amounts of data.

Discussion Points

Are there more things that we should do only in the BEAM/OTP?
What are other scenarios where we should reach for a background job processing system?
What library or setup can you recommend?

lpil · February 26, 2019, 12:20pm

I’m a Rihanna user and recently became a maintainer. For my use Postgres for persistence is the real winner as in my application I care little about performance and a lot about durability. I could attempt to manage the queue state within the cluster but I feel more confident entrusting this job to the fantastic piece of engineering that is Postgres. It also adds no additional operational overhead as I don’t need to form a cluster or add a dedicated external message queue (or Redis).

I’ve previously used Redis backed queues extensively and for me they hit an uncomfortable middle ground. They lack the durability guarantees of Postgres (or similar), they require me to deploy and maintain a Redis cluster, and lack the performance potential of working within the cluster.

keathley · February 26, 2019, 12:22pm

I think your reasoning is correct here. If you need better assurances that you won’t loose jobs if a box is recycled or autoscaled away or otherwise disappears then it makes sense to write those jobs to some sort of storage. You can still run the job processing in your existing BEAMs if that makes sense for your setup.

Another valid use case you mentioned was sending messages between systems. This is a reasonable pattern, especially if you work in a polyglot system. This is a pattern we use at work with both rabbitmq and kafka. If you just need to send messages then rabbit is good. Kafka has really nice properties but its a chore and overkill for most companies IMO.

I haven’t used any of the job processing libs in elixir so I can’t speak to those. Generally though redis is very reasonable for these kinds of operations. I’d personally lean towards that over mnesia/dets. I don’t tend to reach for those tools because 1) My boxes typically have ephemeral file systems and I don’t feel like changing our ops setup and (much much less important) 2) there are some limitations around file sizes with dets and, as a byproduct, mnesia. I think klarna or someone built a leveldb engine for mnesia that looked interesting but I know nothing else about it. I’ve personally seen corruption issues in dets but that was a while ago and presumably that stuff is fixed. But it still made me nervous about those tools and I’ve never taken the time to build that confidence back up. But if you needed to you could presumably do all of this job processing with durable jobs, without ever leaving OTP which is interesting. It just hasn’t been something I’ve needed.

PragTob · February 26, 2019, 1:07pm

Thank you both for your insight @lpil and @keathley!

I love how @lpil put your choice pro rihanna - your app values the robustness higher than the throughput. Naturally what background queue to choose is heavily dependent on your requirement for guarantees of these systems as well as throughput etc.

I didn’t even think about “autoscaling” nodes away, as I’ve never worked in an autoscaling environment File Corruption is a good point, one of the reasons why I tend more to reach for more “robust” solutions such as Postgres or Redis

Thanks!

I think I tried to keep my personal experience a bit out of the initial post, I haven’t used a background queue in elixir (yet) but am otherwise a happy sidekiq (and hence redis) user when it comes to ruby land.

dimitarvp · February 26, 2019, 1:30pm

Could you expand on that, please? I was pondering using Kafka lately and I’m very interested in your reservations to it.

nsweeting · February 26, 2019, 1:42pm

Shameless self-plug… But you may find it useful. I created GenQueue to handle the “background jobs beyond Task” case - allowing one to swap out libraries as needed. Also helps with testing. I personally use the TaskBunny and OPQ adapters alot. Allows the use of the same interface for both libraries which is nice.

PragTob · February 26, 2019, 1:47pm

added it to the list of queue systems, thanks for making us aware of it

keathley · February 26, 2019, 1:50pm

I really like Kafka and have used it heavily for a few years now. But getting it running smoothly can be difficult depending on how familiar you are with ops and JVMs. At the end of the day you’re running a stateful service on a bunch of JVMs. So you’ll need a good understanding of how to get metrics out of the jvm and all of your boxes, you’ll wanna tune your jvms, you’ll want to tune your kafka setup, etc. The data in kafka can’t last forever so typically you window the available data for a limited time; generally no more than a month. So you’ll need a way to rotate the logs and shove them into s3 or some other long term storage. On top of all of that you most likely also need to run a zookeeper as well so you have to do all of that same ops work but this time for zookeeper.

If you’re just getting started I recommend that people vendor their kafka setup. Depending on how much scale you need it’ll probably run you somewhere between $500 and $3000 a month. Thats much cheaper then paying for a dedicated ops team so if its something you really need then it’ll be worth the expense IMO. But if you don’t need kafka’s properties (high write throughput and replicated, durable messages) then it may not be worth it.

dimitarvp · February 26, 2019, 2:23pm

Thanks a lot.

Would you be willing to tell us when you would opt for RabbitMQ and when for Kafka?

yurko · February 26, 2019, 2:40pm

@dimitarvp my 2 cents since I researched the topic some time ago:

Kafka - dumb pipes and smart endpoints
RabbitMQ can have some extra logic like fancier routing

So you move complexity to the MQ or your app, the question is where it belongs, might depend. Also Kafka, being dumb, can handle more traffic, though the limit of them both is more then enough for most of the cases I chose RabbitMQ and was pretty happy but it really depends on the use case.

On Topic: I’d only use MQ to talk to external systems, otherwise GenServer + Tasks + Process.send_after is perfectly fine on the beam side. Postgres for persistance and then Ecto’s streaming to be able to process a lot of entries without exploding the RAM usage.

alvises · February 26, 2019, 3:28pm

This is what we do at work, with rabbitmq (realtime pipeline of microservices processing data from different sources). I think it works really well for realtime data, when an event triggers a task/tasks in one or multiple microservices.

For batch processing… I honestly prefer to use a system just for that, a system that doesn’t use messages but that executes one or multiple tasks, and gives me a log of what happened, something like the old good jenkins, just to say a name (which maybe is overkilling most of the time).
We also use kubernetes cronjobs for recurrent batch processing… but still, I don’t feel as mush in control as having something like sidekiq, jenkins etc…

Anyone of you guys feel the same way on realtime vs batch?

PS: most of the codebase is in python, mainly because machine learning and data science/finance libraries

mikemccall · February 26, 2019, 4:28pm

This was also just announced today

I think if you look at the amount of good available options in the ecosystem as you have listed, it’s clear there is a need for background job queues with a lot of different ways to tackle it. I typically lean towards SQS because it’s cheap, durable, at least once deliverability, widely supported, and it’s managed. (… it can also trigger lambdas… ). My interest in elixir has made me take a deeper look at RabbitMQ. In addition to the obvious purpose as a queue, I like to think of a background job queue as message passing in a language/application agnostic way. i.e.

It makes it easy to introduce new services/technology such as migrating to elixir or use a more fitting technology based on the queue.

90% of our needs for a background job processing system is image processing.

PragTob · February 27, 2019, 7:14pm

I think we’re mixing background job systems and message queues in this thread a bit. If I remember correctly Mike Perham (author of sidekiq, the ruby background job system that is excellent ) defined the difference in (Ithink) this ruby rogues episode as follows: a baclground job system usually is part of the application code (and is hence also written in the same language) while with message queues the consumer is often another application and also potentially in another language.

I know there’s exceptions (as we see with libraries that consume/produce the sidekiq format into redis) or some applications that I know which go through kafka or Message Queues but consume their own messages (for reasons of easy extraction and decoupling).

There definitely is! However, honestly we have so many redis backed queues with a sidekiq format I wouldn’t know which one to pick. I haven’t really investigated them though. I’d personally be happy to have a clear first/main choice (maybe per data store) to concentrate efforts, have more community help etc.

In Ruby you just use Sidekiq and you’re usually fine.

That’s good when the system is yours, my concrete example was meant differently: the system belong to customers or companies we collaborate with (we call both partners usually). A system I maintain usually won’t be done for hours, not the same for some of their old legacy systems

This is very interesting to me - why is that? I assumed you could do image processing just in elixir through processes. Is it that you sometimes get so many of these requests at once so that it’d create too much load on the system? Limited I/O? So that something with a pool of workers where work can pile up more directly is better? Do you need to absolutely make sure none of the jobs gets lost? Or do you have a separate service for image processing (and you use a message queue) so that it’s done in a more efficient language?

edit: haven’t listened to it yet but found Mike talking about Faktory

sorentwo · February 28, 2019, 7:41pm

I’m the author of Kiq and though I could shed some light into why I wrote it when there were already some other redis backed sidekiq formatted job queues out there.

We have been using Sidekiq and Sidekiq Enterprise in an historic Rails app for many years. Once we started most new development in Elixir it became clear that we wanted to run our background jobs in Elixir as well. There was no way that we’d stop development and rewrite nearly a hundred workers in Elixir, so interop with Sidekiq (and features from Pro/Enterprise) was essential.

Some of the older job queues like Exq and Verk would have worked in isolation, but they didn’t fully integrate with Sidekiq. Some particular features that we required:

Full integration with the Sidekiq Enterprise UI (running jobs, running worker nodes, stats)
Expiring jobs
Periodic jobs
Unique jobs
Reliable queues (particularly difficult due to inconsistencies between Ruby and Elixir JSON encoding)
Ad-hoc error reporting
Structured logging

With all of those features in place we’ve been able to run queues in both platforms and slowly migrate workers over one at a time. Typically jobs execute faster and the workers perform more reliably, largely because each queue is managed in isolation. That prevents a bunch of slow jobs in one queue from backing up processing in all of the others.

The internals of Sidekiq are rather scary and the miscellaneous use of data structures makes the system very complex. After Redis Streams were released I started on a fresh approach fully based on streams, which eventually led me to begin another job queue that is stream oriented, but backed by Postgres (it has similarities to Rhianna and EctoJob, but with some key differences). I have some writing to do about that one and I look forward to sharing the details soon.

supernova32 · March 1, 2019, 4:35pm

While I was developing the very first version of AlloyCI I went with exq as it provided something familiar. Coming from the Ruby world, and just starting to learn Elixir, I thought it would be a good idea to use something similar to Sidekiq with similar functionality.

After further learning Elixir, I realized that it was a complete overkill for my use case to have Redis as a dependency for my system, so I ditched exq and replaced it with que.

I was extremely happy with it. It had no external dependencies and provided some persistence for the jobs. It worked perfectly, until Elixir 1.7 was released, then it just crashed every time. It was due to the Amnesia wrapper around mnesia, it used private Elixir functions that had no guarantee of being supported in future releases, so it broke, and stayed broken for months.

At that point I decided to roll out my own background processor just for AlloyCI. It is very simple, and dumb, but does the trick, and is more than enough for my use case. You can check the code here.

The author of Que eventually wrote his own wrapper around mnesia to make it compatible with Elixir 1.7, but at that point I did not require to use it anymore.

Now that I am thinking of adding more functionality to AlloyCI, I might need a background scheduler that can give me better assurances about my jobs. For that I would prefer something with no extra dependencies than the ones I already have, so having a PostgreSQL backed processor would be the ideal scenario for me. From the list provided, I think ecto_job might be the clear winner.

mikemccall · March 1, 2019, 5:25pm

I should have mentioned 90% of our current background jobs processing is image processing. This process is using sidekiq and ruby on internal worker machines. The image processing is cpu intensive and some machines can handle more than others. They come in batches. Sometimes it’s 600 and sometimes it’s 60,000. So going through them all can take some time. I haven’t thought too deeply about how to handle it in elixir. But, I don’t see why we couldn’t remove the external dependencies.

jeremyjh · March 1, 2019, 6:52pm

I thought I would just mention that we’ve been happy users of honeydew for many months in production now. One of the nice things about honeydew’s Ecto queues is that they can be implemented as just a couple of columns added to an existing table. This way you can be sure a given entity has only one job scheduled to run on it at at time, and you can do things like have the default of a new record to be schedule a job related to it. It is also helpful that job schedule is part of the same transaction as other operations you are doing in your application; so a job is only actually scheduled if the transaction it is a part of goes through.

lucaong · March 6, 2019, 3:01pm

I personally very much agree with @PragTob’s analysis of the reasons for/against using a background job queue.

In my company we use RabbitMQ exactly for the aforementioned use-cases: retry with exponential backoff, durable queues in presence of restarts, scheduling jobs into the future (with the delayed-message-exchange plugin). It is true that RabbitMQ is a messaging queue rather than a background job system, but I think it’s fair to consider the two concepts overlapping a bit, at least because a background job system is built upon some sort of message queue.

We find RabbitMQ very versatile: it might not be the fastest kid in the block (although it’s very fast), but its semantics allow for a wide variety of solutions and topologies. I think of it as a great general-purpose tool (we even use it as an MQTT broker for IoT telemetry), as opposed to specialized background job processors like Sidekiq-inspired tools, or more “use-case-optimized” tools like Kafka. The versatility also comes with a slightly steeper learning curve though, as one needs to learn how to use the provided building blocks, rather than a turn-key solution.

In addition to reasons dealing with job completion guarantees, another valid reason to introduce a queue could be as a decoupling device: if there is benefit in the producer of a job not having to know details about the consumer, and the two being developed and/or scaled independently, then a background queue can be a good solution.

bamorim · April 29, 2019, 2:43pm

Really nice discussion here.
Where I work we do care a lot about consistency and not losing any jobs, for that reason we use ecto_job (rihanna would work too).

One thing about consistency is that there is some problems other than losing messages. One of them is the fact that if you are in a database transaction and send a message, your message may be sent but your transaction may still fail. In that scenario, you’ve sent the message too early.
To solve that there is a well-known pattern, the Outbox pattern. In order to achieve that, I needed something that uses the same Ecto.Repo as my application and therefore I can run inside a transaction (when scheduling a new job).

If you need more throughput you may want to have your jobs do not handle database connection. So you may still want to do the heavy lifting using some SQS or RabbitMQ based job, so you can just have your “PSQL backed jobs” just schedule the job in other system.

Up until now, we are fine with PSQL bakcked job queue.

We also use Kafka to publish events (as our pubsub infrastructure), and we use EctoJob as a gateway to publish kafka events, so we avoid publishing events too early or losing events.

koudelka · May 6, 2019, 6:17am

Honeydew’s Ecto Queue might be of interest to you.

It acts as a “follower” and keeps completely out of your insertion transaction. It only comes along later to look for rows that were successfully inserted, but haven’t yet executed a job.