Durable Workflow Execution (temporal alternative for elixir)

Hello!

Suppose you are building workflow (order / task / payment) processing system with the following requirements:

  • Each workflow consists of several steps.
  • Each step can fail (throw exception or return error) or timeout and need to be retried according to retry policy.
  • State machine - choice of next step in a workflow depends on the result of the previous step.
  • Durable execution - workflows and steps taken and their success or failure are persisted to DB, so that workflows are never (e.g. if server fails) lost and can be resumed from the next step.
  • State visibility - Web UI to monitor workflow progress.
  • Other less important requirements, e.g. auto scaling of workers executing steps, cancellation of workflows, signalling to workflows, …

I believe this system is very often required for many websites and is a good fit to Elixir, however, I believe there is no ready to use solution in Elixir and developing such a system for each website separately is a waste of time and resources. Do you agree with this statement? How would you go about developing such a system in Elixir? Writing from ground up? May be some framework I have missed?

Solutions I have considered:

  • Temporal - almost ideal fit to the system requirements, however, no SDK for Elixir (unofficial SDK in development, should be ready within 2-3 months) and I believe is not an ideal fit to Elixir. I think a better solution for Elixir would be to run Service managing state and Workers executing steps together in BEAM. This would allow to e.g., leverage BEAM’s message passing, use cache inside BEAM (e.g. Cachex) and in general have less dependencies.
  • Oban - provides durable execution, retries, timeouts, however, not a state machine as is made for durable background job (not workflow) processing. While at first it seems that adding state machine is not a problem at all, I believe this would require many tricks and hacks, thus, poor code clarity and poor state visibility - Oban’s Web UI is not made for this use case. E.g., I have considered using Oban Workflows with ignore_discarded, however, this required scheduling steps for both cases (previous step failed or succeeded) and creates messy code as well as poor workflow visibility.
1 Like

This is something we plan to fix. A dedicated workflow view is slated for Oban Web.

There’s no need to schedule separate jobs for the result of the previous job. With recorded jobs it’s simple to fetch the result from the previous step, or even to check the status of the previous step if needed. Here’s an example:

@impl Oban.Pro.Worker
def process(job) do
  case Workflow.get_recorded(job, :previous_step) do
    nil ->
      # It failed and nothing was recorded, handle that case.
    value ->
      # It worked, carry on
  end
end
1 Like

I agree, another option is to schedule single job which handles both cases, however, in that case at the next step (3rd one) job would need to handle 4 cases (1st failed 2nd succeeded; 1st failed 2nd failed, 1st succeeded 2nd succeeded; 1st succeeded 2nd failed). On the next step 8 cases. To sum up, if state transition graph is at least a bit complex, this solution creates messy code. As well as it decreases visibility as now single job could represent completely different operations (e.g. cleanup after previous job assuming it failed, or processing of the next step assuming previous job succeeded), thus, from the look at the single job (without looking at all previous this workflow jobs) in the Web UI it is unclear what is happening.