What is the best way to do stateful background jobs

by stateful I mean persistent to storage to be run in the future. In the rails world we have sidekiq.

I’ve seen some implementations written with Redis (Verk). However, can’t this be done with tools like Amensia/ETS tables or even GenServer?

2 Likes

For small jobs that you want to simply do asynchroniously, or execute starting a couple of milliseconds or seconds from now, plain GenServers are fine. There is a great function called Process.send_after/3 that sends a certain process a message after the supplied amount of milliseconds. You can even have a process that keeps calling itself ever so often, by having it send messages using send_after to itself!

If you want to make sure that a job is executed periodically (i.e. at the start of every hour, or once per week on midnight on monday, etc), there are different solutions that work better. Using send_after for this is not so good, as the process that you sent something to might have exited for one reason or another in the meantime. Luckily, there are nice libraries that do the hard work for you, such as the quantum that make sure that a new process is started to execute a given job at the correct time.

Note that no Sidekiq, Redis, Crontabs or other external dependencies are necessary. Everything happens inside the BEAM itself.

5 Likes

There’s a lot that can be said here, but I’ll try to keep it brief

To start with, we’re gonna set aside brief stuff that can just be handled with Task or synchronously. There are legitimate scenarios where you want a persisted queue, and limited concurrency.

There are at least 2 major challenges to overcome, and these challenges are why people still choose redis. I think there are viable alternatives, but none are nicely encapsulated in a library at the moment.

  1. persistence
  2. consistency

With persistence, when a server goes down and comes back up, how is the job queue and job state rebuilt? GenServers lose their state when they die. :ets loses its data when it dies. Redis (like a database) lets you offload this problem

Consistency is another big challenge. If you’re running N nodes are you running N queues with N worker pools? Or are you trying to have a job queued in Node 1 possibly end up with a worker in Node 2? How do you handle the myriad of pitfalls associated with distributed data? :mnesia gives you a lot of answers, but has problematic net split behaviour. There are apparently some libraries in the erlang world for handling netsplit recovery in an automated way but both mnesia and these other tools lack modern elixir oriented documentation.

Redis “solves” this by just not being distributed. It’s simple, and for many it works.

These are hard challenges, and I don’t think any of the existing job libraries offer a compelling alternative. Most simply pick the Sidekiq route, which is a reasonable choice to make.

I think there alternatives are possible, and it’s an active area of interest for me, but for now you’re stuck rolling your own or just biting the bullet and using redis.

10 Likes

Wow, I didn’t take into account the net split recovery and distribution.
Great insights, thanks!

I guess this can be done with Mnesia disk_copies. As fast as Redis, replicated across your nodes and persistent. Regarding net split, have a look at this and this discussions (basically should restart Mnesia after net split) and unsplit

2 Likes

Basically, I agree with most of what @benwilson512 said.

The question is really which guarantees do you need. Say we have the enqueue_job function, and it returns :ok. Do you want to ensure that the job will be invoked no matter what happens, including situations when all the machines in the cluster are restarted or replaced?

If the answer is yes, then I’d say you need some strongly consistent database to store that info. That db could be redis or it could be anything else that meets your requirement. Personally I’m more a fan of SQL for storing critical data, but that’s just me :slight_smile:. You could also consider riak_ensemble which would allow you to run strongly consistent k-v database directly from your Elixir based system (so no external component would be required).

Either way, a reliable db will ensure that a stored job can be executed as long (or as soon) as one db instance and one app instance are running.

If you have less strict requirements, for example it’s ok to skip some jobs, or to occasionally run a job more than once, then you might get away with some lightweight solution such as ETS or GenServer.

1 Like

@benwilson512 I think this is twice now I’ve seen you vaguely refer to “alternatives.” Do you have any blogs or wip repos with these ideas? In the back of my mind is an idea to figure out how to do sufficient locking in postgres to use the db as the queue – for those cases when durability is more important than processing efficiency.