Any tips on building a workflow engine similar to n8n?

Hi folks,

I’m working to build a workflow engine for my company project. We would like to build something dynamic and easy to add new integration with 3rd party services because we are building a kind of centralized hub for e-commerce.
I don’t have much experiences with building something so dynamic like that.
I would like to ask for advices on the system component, how an engine should be, which patterns I can apply. I started this project a couple of weeks ago, and with the help of AI I build and refactor and repeat. But not sure if I’m on the right direction.

Here is my work in progress GitHub - bluzky/prana

Please give me some insight, that’ll be great help.

Thank you

8 Likes

Useful project.

Suggestions/inquiries:

Can you create examples/README.md and/or ref it in the main README.md?

Are there external integrations, either with mocks or disabled by default when running mix test with clear tags?

Are you doing static analysis with dialyzer?

Perhaps consider adding a runner for github to show the green check, indicating all default tests pass?

Thanks @gtcode, external integrations are implemented at main application.
I’m working to test integrating to our main application. Keep back and ford to refactor to adapt to application needs. Lots of unknown here.

Glific is a graph-based chatbot creation platform. The backend is in Elixir which handles the floweditor engine, like executing a graph of nodes. Maybe you can check this out.

1 Like

Thank you, I’m looking into it

I went through the codebase, everything I would expect of a workflow engine is there. Very nice implementation, good work.

What I would expect from an engine is the ability to recover from:

  • kill -9 erlang vm
  • rm -rf /local/data
  • k8s cluster delete

When i start my elixir cluster again, absolutely nothing should be lost and the system should eventually return to my 10000 workflows to a running state as if nothing happened.

That is one of the hard parts about these things.

There a few open source examples for your examination:

temporal.io · GitHub (more programmer focused)
Oban Pro has the ability to do run workflows, however it’s closed source (possibly wrong here)

^ those are like end game solutions, it’s not actually too hard to track workflow execution state using postgres or similar to get durability. The simple solution works just as well at a normal scale :slight_smile:

I’d also look at business process engines, same idea, but just different style of implementation.

Awesome work though :+1: Looking forward to seeing how it progresses over time.

2 Likes

While reading your post, postgres came to mind for durability, and then you mentioned it!

I’m working on related problems. Thanks to reading the posts, I just thought of these integration concerns related to durability and resiliency:

Step Recovery

Pure Deterministic (No Side Effects): Mathematical calculations, data transformations, pure functions. These are truly idempotent - same input always produces same output with no external impact. Retry freely without concern. Examples: JSON parsing, mathematical computations, string manipulations.

Deterministic with Side Effects: Database writes, file operations, REST API calls with predictable behavior. The logic is deterministic but external state matters. Implement conditional idempotency using techniques like: check-then-act patterns, unique transaction IDs, upserts over inserts, and proper state validation before retry. Examples: user registration, inventory updates, email notifications with deduplication keys.

Non-Deterministic without Side Effects: AI inference, random number generation, heuristic algorithms that don’t modify external state. Use semantic idempotency - ensure the intent/goal remains consistent even if exact outputs differ. Checkpoint based on meaningful progress rather than exact state. Examples: content generation, recommendation algorithms, data analysis with ML models.

Non-Deterministic with Side Effects: AI agents making API calls, LLM-powered workflows that interact with external systems, adaptive processes that learn from environment. The most complex category - combine semantic idempotency with careful side effect management. Implement context-aware checkpointing that captures both progress and environmental assumptions, validate external state on recovery, and design for graceful adaptation when conditions change. Examples: AI agents conducting research and filing reports, automated trading systems, dynamic workflow orchestration.

Workflow Management

Design Strategy: Structure your workflow engine with step classification at the core - each step declares its category (pure deterministic, deterministic+side effects, non-deterministic, non-deterministic+side effects) which drives the recovery logic. Store step definitions, execution state, and checkpoints in Postgres with proper transaction boundaries.

Implementation Approach: Use Postgres transactions to atomically update step status and checkpoint data. For deterministic steps, store minimal state (input/output hashes, completion flags). For non-deterministic steps, serialize rich checkpoint data including context, assumptions, and partial progress. Implement a recovery coordinator that reads the step type and applies the appropriate strategy: simple retry for pure functions, conditional retry with state validation for side-effect deterministic steps, checkpoint-based resumption for non-deterministic work. Leverage Postgres’s ACID properties to ensure your workflow state itself remains consistent even when individual steps fail, and use row-level locking to handle concurrent workflow executions safely.​​​​​​​​​​​​​​​​

Please share revisions/correction/enhancements based on your knowledge and experience.

3 Likes

Thank @nerdyworm for your references.

Recovering workflow execution is what I’m stuck with. Thanks @gtcode for pointing out above, it’s very clear. your comments lighted up my mind. I’m currently overloaded by too many problems, so I decide to integrate with our main application first to test a real workflow. Then I’ll come back to add more improvement later. I know that will break lots of things and take much effort to refactor. But I think it’s better than get stuck of trying solving complicated problems and make no real progress

I’ll update progress later when I come back to play with these recovery strategies.

Hi, we have an internal workflow engine that quite different from n8n (code-only). Two thoughts. The API may be easier to read if you take inspiration from Oban workflows: Oban.Pro.Workflow — Oban Pro v1.6.2. (Btw we are Oban Pro users and highly recommend!)

    Workflow.new()
    |> Workflow.add(:a, new_echo(1))
    |> Workflow.add(:b, new_echo(2), deps: :a)
    |> Workflow.add(:c, new_echo(3), deps: :b)
    |> Workflow.add(:d, new_echo(4), deps: :b)
    |> Workflow.add(:e, new_echo(5), deps: [:c, :d])
    |> Oban.insert_all()

If you use a library like GitHub - bitwalker/libgraph: A graph data structure library for Elixir projects validation/orchestration might be easier.

On another note, if you are interested in durable execution, we took a lot of inspiration from build systems and Nix. A lot of the same problems have been solved by compilers.

For example, imagine you need to compile a project and there is a compile error. What happens? A subset of the artifacts get stored and cached. Then you fix the error, recompile, and it picks up where you left off. Some of the compilation work is skipped over (no-op) and the ones for which you fixed the error are able to keep going.

Hope that helps, happy to discuss further ideas. Having something like n8n for Elixir would be amazing!

5 Likes

I would look at (or use) Oban and Ash Reactor. Maybe someone in the Ash project would be interested in collaborating.

IMO gui workflow definition is optional. Kestra uses YAML files…

N8N has security/privacy/sustainability problems. An open-source Elixir-based workflow engine would be great.

3 Likes

We built something like this at Bridge Connector, before the company went under (bad mgmt).

Let me know what you’re struggling on at the moment and maybe I can help.

Take a look at OpenFn if you’re curious to see other approaches too

Questions:

  • presumably your user is someone building a workflow from a UI from individual steps?
  • is there branching logic?
  • dependency concepts? (e.g. branches combining)
  • are the integrations using your API keys or theirs? is noisy neighbor/workflow a concern?
  • is it json schema for the workflow inputs?

Let me know, and I’d be happy to hop on a call sometime too

2 Likes

Thanks @venkatd, I wish I know libgraph before, so I don’t have to implement my own graph builder. Oban Workflow builder helpers are so clean.

Thank you, OpenFn look interesting, I don’t know it’s opensource before.

Regarding to your questions:

  • I want to support: multi-input node, conditional branching, loop, trigger sub-workflow(async-resume, synchronous), these patterns have been tested in my application.
  • I want to define reusable steps, then workflow combines those steps.
  • This library focuses only on execute workflow. Application handle the rest: database persistent, trigger sub workflow, suspend-resume, and integration. Application should provide environment variable, secrets, integration API keys …
  • I’ve not thought of json schema for inputs, just schema for reusable step config.
  • User may define workflow using json map. But it’s easy to make mistake so I’m working to build minimal UI for workflow editor.

as @gtcode mentioned above, my biggest challenge is how recover from failing: node crashed, restart while execution running. Even I save execution state to DB, I don’t know if it’s running or it crashed and DB status is running. How to detect if the node which run that workflow is dead or alive.
Another challenge is how to handle workflow retry: should retry whole workflow or single node, because some steps are not idempotent

It would be great to have open source workflow system.

Have you checked Reactor from Ash team as maybe it could supply some of the required functions while the new system should handle issues like data/steps persistence and pausing the flow for manual actions “approve/decline” ,..etc
Also I think maybe you can look at BPM like GitHub - synrc/bpe: 💠 BPE: BPMN Process Engine ISO 19510

1 Like

Thank you for your references

1 Like

@bluzky

I found this one too could be interesting

1 Like

Thank you, we’re in integration phase so it’s hard to change the current implementation

1 Like

Oooh this is so cool!

For some of my projects, I built what is, essentially, a “durable reactive graph“ library (journey). It is not exactly what you are building (I think, since this is very different from n8n in many respects), but it has some similarities that you might find useful, at least at the level of concepts, in that it lets you define a graph with inputs and computations, with direct or conditional dependencies:

import Journey.Node
graph = Journey.new_graph(
  "demo graph",
  "v1",
  [
    input(:x),
    input(:y),
    # :sum is unblocked when :x and :y are provided.
    compute(:sum, [:x, :y], fn %{x: x, y: y} -> {:ok, x + y} end),
    # :large_value_alert is unblocked when :sum is provided and is greater than 40.
    compute(
        :large_value_alert,
        [sum: fn sum_node -> sum_node.node_value > 40 end],
        fn %{sum: sum} -> {:ok, "🚨, at #{sum}"} end,
        f_on_save: fn _execution_id, _result ->
           # (e.g. send a pubsub notification to the LiveView process, to update the UI)
           :ok
        end
    )
  ]
)

and then to create and run executions of that graph (set their input values, and read their computed values):

execution = Journey.start_execution(graph)
execution = Journey.set_value(execution, :x, 12)
execution = Journey.set_value(execution, :y, 2)
Journey.get_value(execution, :sum, wait_any: true)
{:ok, 14}
Journey.get_value(execution, :large_value_alert)
{:error, :not_set}

I think this might be different from your use case in some ways (?), and so Journey takes care of a number of things behind the scenes that I needed for my use cases (e.g. Phoenix applications that guide users through a multi-step, conditional flow).

(For my use cases, I needed persistence (and so the value of every node is persisted as soon as it is set or computed), reliability of computations (so the functions in those compute nodes are subject to retry policies), horizontal scalability of computations (those functions will be picked up and called on any of the replicas of my application). I also needed careful mutability and one-time and recurring scheduling (so journey also has mutate(), schedule_once() and schedule_recurring() types of nodes, in addition to input() and compute() that you see in this example), and durability (so that I can always Journey.load() an execution by its id, even after all kinds of outages / redeployments / page reloads, at any time in the future, and it will continue as if nothing happened).)

I also needed some niceties, which may or may not be relevant for your case – things like visualizing the graph:

Journey.Tools.generate_mermaid_graph(graph)

here is the mermaid that gets generated:

graph TD
    %% Graph
    subgraph Graph["🧩 'demo graph', version v1"]
        execution_id[execution_id]
        last_updated_at[last_updated_at]
        x[x]
        y[y]
        sum["sum<br/>(anonymous fn)"]
        large_value_alert["large_value_alert<br/>(anonymous fn)"]

        x -->  sum
        y -->  sum
        sum -->  large_value_alert
    end

    %% Styling
    classDef inputNode fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    classDef computeNode fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000000
    classDef scheduleNode fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000
    classDef mutateNode fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000000

    %% Apply styles to actual nodes
    class y,x,last_updated_at,execution_id inputNode
    class large_value_alert,sum computeNode

To help me understand the state of things (e.g. “hey did Mario get their reminder email?”), Journey has some introspection tooling that tells me the state of an execution. In this example you can see that :sum didn’t get computed because :y is yet to be provided:

iex(8)> Journey.Tools.summarize_as_text(execution.id) |> IO.puts
Execution summary:
- ID: 'EXECAT36JBH8VTM82150REVX'
- Graph: 'demo graph' | 'v1'
- Archived at: not archived
- Created at: 2025-09-04 05:06:04Z UTC | 54 seconds ago
- Last updated at: 2025-09-04 05:06:39Z UTC | 19 seconds ago
- Duration: 35 seconds
- Revision: 1
- # of Values: 3 (set) / 6 (total)
- # of Computations: 2

Values:
- Set:
  - last_updated_at: '1756962399' | :input
    set at 2025-09-04 05:06:39Z | rev: 1

  - x: '12' | :input
    set at 2025-09-04 05:06:39Z | rev: 1

  - execution_id: 'EXECAT36JBH8VTM82150REVX' | :input
    set at 2025-09-04 05:06:04Z | rev: 0


- Not set:
  - large_value_alert: <unk> | :compute
  - sum: <unk> | :compute
  - y: <unk> | :input

Computations:
- Completed:


- Outstanding:
  - sum: ⬜ :not_set (not yet attempted) | :compute
       :and
        ├─ ✅ :x | &provided?/1 | rev 1
        └─ 🛑 :y | &provided?/1
  - large_value_alert: ⬜ :not_set (not yet attempted) | :compute
       🛑 :sum | &-normalize_gated_by/1-fun-0-/1

I have been using Journey in Phoenix applications, and since every customer flow is a Journey execution, it is easy for Journey to generate a bit of “flow analytics” / “flow funnel” data. This example, with just 3 executions in the system, is not particularly rich, ; ) and its too new to have any retrospective “flow ends here” data, but think of a web analytics funnel, and you get the idea:

iex(18)> Journey.Insights.FlowAnalytics.flow_analytics(graph.name, graph.version)  |> Journey.Insights.FlowAnalytics.to_text()|> IO.puts
Graph: 'demo graph'
Version: 'v1'
Analyzed at: 2025-09-04T05:10:16.228250Z

EXECUTION STATS:
----------
Total executions: 3
Average duration: 23 seconds
Median duration: 20 seconds

NODE STATS (4 nodes):
----------
Node Name: 'x'
Type: input
Reached by: 3 executions (100.0%)
Average time to reach: 19 seconds
Flow ends here: 0 executions (0.0% of all, 0.0% of reached)

Node Name: 'sum'
Type: compute
Reached by: 2 executions (66.7%)
Average time to reach: 17 seconds
Flow ends here: 0 executions (0.0% of all, 0.0% of reached)

Node Name: 'y'
Type: input
Reached by: 2 executions (66.7%)
Average time to reach: 17 seconds
Flow ends here: 0 executions (0.0% of all, 0.0% of reached)

Node Name: 'large_value_alert'
Type: compute
Reached by: 1 executions (33.3%)
Average time to reach: 13 seconds
Flow ends here: 0 executions (0.0% of all, 0.0% of reached)

, etc.

I don’t know how this aligns with your particular use case, but this has been quite useful for building Phoenix applications that take a user through a flow that needs to be durable and reliable. I don’t need to keep reimplementing the same plumbing, and my applications have become thin and lightweight, and well-structured.

I am excited to see where your project takes you! ; )

4 Likes

It is fine. Actually I found out about it while checking implementation plan with Claude code :grin: so for the workflow need it recommended Temporal so hopefully soon a full backed elixir solution can be the answer.

It’s been a couple of months since we started integrating the engine to our project.
I recognized the best approach is to build the base and then use in real project and improve the engine gradually. As there are many unknowns so it took much effort to figure out and much more effort to refactor but it’s worthy. :smiley:

I also build an admin UI to manage the workflow and run with test input

1 Like