Hello!
Suppose you are building workflow (order / task / payment) processing system with the following requirements:
- Each workflow consists of several steps.
- Each step can fail (throw exception or return error) or timeout and need to be retried according to retry policy.
- State machine - choice of next step in a workflow depends on the result of the previous step.
- Durable execution - workflows and steps taken and their success or failure are persisted to DB, so that workflows are never (e.g. if server fails) lost and can be resumed from the next step.
- State visibility - Web UI to monitor workflow progress.
- Other less important requirements, e.g. auto scaling of workers executing steps, cancellation of workflows, signalling to workflows, …
I believe this system is very often required for many websites and is a good fit to Elixir, however, I believe there is no ready to use solution in Elixir and developing such a system for each website separately is a waste of time and resources. Do you agree with this statement? How would you go about developing such a system in Elixir? Writing from ground up? May be some framework I have missed?
Solutions I have considered:
- Temporal - almost ideal fit to the system requirements, however, no SDK for Elixir (unofficial SDK in development, should be ready within 2-3 months) and I believe is not an ideal fit to Elixir. I think a better solution for Elixir would be to run Service managing state and Workers executing steps together in BEAM. This would allow to e.g., leverage BEAM’s message passing, use cache inside BEAM (e.g. Cachex) and in general have less dependencies.
- Oban - provides durable execution, retries, timeouts, however, not a state machine as is made for durable background job (not workflow) processing. While at first it seems that adding state machine is not a problem at all, I believe this would require many tricks and hacks, thus, poor code clarity and poor state visibility - Oban’s Web UI is not made for this use case. E.g., I have considered using Oban Workflows with ignore_discarded, however, this required scheduling steps for both cases (previous step failed or succeeded) and creates messy code as well as poor workflow visibility.