Durable Workflow Execution (temporal alternative for elixir)

stjefim · March 14, 2025, 8:57am

Hello!

Suppose you are building workflow (order / task / payment) processing system with the following requirements:

Each workflow consists of several steps.
Each step can fail (throw exception or return error) or timeout and need to be retried according to retry policy.
State machine - choice of next step in a workflow depends on the result of the previous step.
Durable execution - workflows and steps taken and their success or failure are persisted to DB, so that workflows are never (e.g. if server fails) lost and can be resumed from the next step.
State visibility - Web UI to monitor workflow progress.
Other less important requirements, e.g. auto scaling of workers executing steps, cancellation of workflows, signalling to workflows, …

I believe this system is very often required for many websites and is a good fit to Elixir, however, I believe there is no ready to use solution in Elixir and developing such a system for each website separately is a waste of time and resources. Do you agree with this statement? How would you go about developing such a system in Elixir? Writing from ground up? May be some framework I have missed?

Solutions I have considered:

Temporal - almost ideal fit to the system requirements, however, no SDK for Elixir (unofficial SDK in development, should be ready within 2-3 months) and I believe is not an ideal fit to Elixir. I think a better solution for Elixir would be to run Service managing state and Workers executing steps together in BEAM. This would allow to e.g., leverage BEAM’s message passing, use cache inside BEAM (e.g. Cachex) and in general have less dependencies.
Oban - provides durable execution, retries, timeouts, however, not a state machine as is made for durable background job (not workflow) processing. While at first it seems that adding state machine is not a problem at all, I believe this would require many tricks and hacks, thus, poor code clarity and poor state visibility - Oban’s Web UI is not made for this use case. E.g., I have considered using Oban Workflows with ignore_discarded, however, this required scheduling steps for both cases (previous step failed or succeeded) and creates messy code as well as poor workflow visibility.

sorentwo · March 14, 2025, 3:11pm

This is something we plan to fix. A dedicated workflow view is slated for Oban Web.

There’s no need to schedule separate jobs for the result of the previous job. With recorded jobs it’s simple to fetch the result from the previous step, or even to check the status of the previous step if needed. Here’s an example:

@impl Oban.Pro.Worker
def process(job) do
  case Workflow.get_recorded(job, :previous_step) do
    nil ->
      # It failed and nothing was recorded, handle that case.
    value ->
      # It worked, carry on
  end
end

stjefim · March 14, 2025, 3:57pm

I agree, another option is to schedule single job which handles both cases, however, in that case at the next step (3rd one) job would need to handle 4 cases (1st failed 2nd succeeded; 1st failed 2nd failed, 1st succeeded 2nd succeeded; 1st succeeded 2nd failed). On the next step 8 cases. To sum up, if state transition graph is at least a bit complex, this solution creates messy code. As well as it decreases visibility as now single job could represent completely different operations (e.g. cleanup after previous job assuming it failed, or processing of the next step assuming previous job succeeded), thus, from the look at the single job (without looking at all previous this workflow jobs) in the Web UI it is unclear what is happening.

effinbanjos · May 29, 2025, 3:38pm

I’ve been thinking about this a bit too, having just discovered Temporal recently. I have use-cases where I think I need python workers (not just an embedded script a la Pythonx) due to library support and it seems it would be pretty nice to use the same orchestration layer - no idea what it actually feels like in practice, of course. I suppose it provides Elixir with another path for integration within heterogeneous environments.

MrDoops · May 29, 2025, 4:00pm

I’m building something like this on top of Runic. No guarantees on when the web UI & durable execution is at a production ready state though. I have a durable runner implemented but still working on check-pointing for long running workflows. There’s also quite a bit one would normally want in this sort of thing like triggers (e.g. webhooks, messaging system integrations), CRON / time scheduled workflows, and so on.

I’d recommend Oban if you want a durable execution with graph based workflows today.

While it’s not a DAG: GitHub - commanded/commanded: Use Commanded to build Elixir CQRS/ES applications is also this sort of thing but with DDD/CQRS abstractions.

Paulo released handoff recently which uses dags: GitHub - polvalente/handoff: Distributed graph execution in Elixir

effinbanjos · May 29, 2025, 7:52pm

That’s a good point about CQRS and Commanded - quite a robust native Elixir option. I hadn’t even heard of handoff - very cool looking!