Jumbo: New job queueing library

Hello,

I’ve just published 1.0.0 of Jumbo - a new job queueing library: https://github.com/mspanc/jumbo

I was a bit disappointed by lack of stability of both Exq and Toniq under very heavy workloads so I’ve written pure OTP lib.

Waiting for comments!

10 Likes

I just read readme, and in one part you call queues SampleApp.QueueHeavy and SampleApp.QueueLight and in other SampleApp.Queue.Heavy and SampleApp.Queue.Light notice the additional dot after Queue Is it simply different naming without any consequences, or is there more to it?

1 Like

Awesome!

One thing that bothers me is the requirement to have a module that implements the perform function. Why now allow the user to pass module, function, args - just like in a task, instead? This would make it much more generic, and allow using with functions that weren’t implemented with this library in mind.

2 Likes

I think I just made a typo.

Nice suggestion! I’ll change this in the future release.

@mspanc I was having a look at the docs and something made me curious (I have little knowledge of job queue libs, so don’t mind me if this sounds naive): Without persistence, what happens if the machine goes down? Don’t you lose any enqueued jobs? It seems to me that any queueing library that has to do business critical work, should have a persistence mechanism, no?

That is a big assumption that all the stuff in queue is business-critical. In my experience, for a lot of system, background queue is where you push stuff like sending e-email, generating thumbnails etc. etc.

Sending e-mails probably the most common case and even in simple systems you want to push it to the background queue. And this can fail for other reasons too, so is not reliable by definition.

There’s plenty of queue use cases that are more complicated but can do well without persistence.

There are a few issues with the error handling.

The naive assumptions about the :DOWN reasons are incorrect in many situations. The simplest being a failed GenServer.call A library should not (because it can not) make assumptions about what happen based on the exit reason of a process. However we are able to format them with the Exception module that will get it right 99% of the time.

Note that it is possible for a generic user to call exit(:normal) which will mean that there will be a stopped gracefully but not a job ok log message. This will be treated as success but a task that calls exit(:normal) is treated a as failure because it did not return a response.

It is also possible for an async task to send a response but to exit abnormally - it is possible that the process receives an exit signal in between sending the response and exiting. Therefore when you receive a response you want to treat that as a successful job and demonitor with flush.

Calling Logger macros in a separate function is a poor pattern because information is lost. Meta data, such as the module, function and line, are included in the log event automatically. Moving the log messages to their own function loses the function and line information. It would also be more idiomatic to include custom metadata in the log messages than prefix them with the meta data…

The fault tolerance of a queue is not ideal and it somewhat breaks guarantees that the supervision (and queue) are trying to provide. A supervision tree intends that when a process exits in its tree that everything below it has been terminated or will exit asynchronously when it does (descendants are neighbours and not trapping exits). If a child is trapping exits then it may take “some time” to terminate because the exit signal is no propagated. This means that the child can still exist when the process is restarted because it is temporarily orphaned. Therefore if trapping exits in a child process the parent should also trap exits and terminate its child in its terminate callback. This guarantees that clean up occurs before the restart so that a restart is given a clean slate. Note with the current implementation that the Task.Supervisor is trapping exits and so during a restart the concurrency limit is not enforced.

I think in this situation a Queue should be a one_for_all supervisor because the Task.Supervisor needs to be started before the queue server can start any jobs, and the Task.Supervisor should be shutdown when the queue server exits. By providing a supervisor at the top of the libraries tree it also allows more freedom to make changes to supervision in the future. It is also follows OTP principles more closely and leaves error handling to a supervisor.

If there is a single job in the queue and it fails, it will be run X number of times and then be lost forever. I am unsure if the tight loop is desirable and whether their should be a user callback to handle a dropped job.

2 Likes

fyi, queued jobs are executed in random order, which was not expected by me at least.

Domain.Message
|> Domain.Repo.all
|> Enum.sort(&(&1.id > &2.id)) 
|> Enum.each(fn msg -> Jumbo.Queue.enqueue(Domain.Queue.Light, Domain.SampleSleepJob, [msg]) end)

defmodule Domain.SampleSleepJob do
  def perform(message) do
      :timer.sleep(1000)
      time = DateTime.utc_now
      IO.puts "#{time.hour}:#{time.minute}:#{time.second} - #{message.id}"
  end
end

if you run that with concurrency: 1 - execution of the queue is in rather random order…

@abitdodgy At the moment the jobs are gone upon restart. I agree with further post of @hubertlepicki that assumption that good job is a persistent job is invalid. For example in most of the systems I make, due to their nature, you can rebuild the job queue from other data sources upon boot. Adding persistency mechanism as it works in Ruby’s Sidekiq or Exq/Toniq just adds some redundancy. Moreover, as @hubertlepicki stated, you often queue tasks that are non-critical anyway. However, if time will allow I will add some persistency mechanisms but they will be optional.

@outlog indeed, currently there’s no guarantee when it comes to job order. Nice suggestion for a future release, thanks!

1 Like

@fishcakez thank you for great and deep review. Wow!

I am aware of these limitations. I had to write this library because I quickly needed some replacement for exq/toniq due to some fu***up in one project where we needed to reprocess tens of thousands of files and none of the alternatives were reliable enough to at least reach half of the queue…

So while I agree that there are some corner cases that might affect queue behaviour, they are, well, 1% of cases that I wasn’t intending to cover in the very first release, which does not mean they won’t be improved in the future releases. In real life, in most of the apps jobs are not things that affect the process tree, they are rather simple tasks, such as sending mails, processing some data etc. It is highly unlikely that someone will call exit(:normal) in a job. Perfect Library should obviously handle such cases but I was time constrained so I wanted to make Good Enough For 99% Cases Library. At least for 1.0.0.

According to my own experience even now Jumbo is way more stable and faster than Exq/Toniq so it fits into Good Enough principle :slight_smile:

Callbacks are on the TODO list.

I think you should handle the first issue even if you ignore the rest. That is the assumption that an exit of the form {_, _} is an exception. This is an invalid assumption and will catch many exit/1 calls that are far from edge cases. Formatting errors incorrectly when we have the Exception is inexcusable.

@fishcakez let me clarify, so I am sure that I understand you correctly.

When you write about {_,_} you mean improper callback handler in https://github.com/mspanc/jumbo/blob/bfed4e62b2dc432e7c150f886ecf1565bb51991c/lib/jumbo/queue.ex#L277, right?

When you write about formatting, you mean using http://elixir-lang.org/docs/v1.3/elixir/Exception.html#format/3 and similar functions, right?

I meant specifically the pattern on this line: https://github.com/mspanc/jumbo/blob/bfed4e62b2dc432e7c150f886ecf1565bb51991c/lib/jumbo/queue.ex#L317 and the other :DOWN patterns (but they are being handled ok). We have Exception.format_exit(reason) or Exception.format({:EXIT, pid}, reason) that you should use to format the reason from the :DOWN tuple.

Oh I just realise that other :DOWN reasons are not handled! reason can be any term!

Ok now I get it, thanks!

I believe you might be able to delegate persistency, as well as restarting of failed jobs, to library users by supporting callbacks. If you invoke a custom callback function when the job is about to be started, and after it finishes or crashes, it would give users the chance to store job state into an arbitrary database, and/or requeue jobs if they crash.

That would keep the job queue logic pretty simple, and yet provide a lot of flexibility.

6 Likes