Squid Mesh - workflow automation runtime for Elixir applications

Squid Mesh is an open source workflow automation runtime for Elixir applications.

It is aimed at Phoenix and OTP apps that want to define and run durable workflows in code without rebuilding the runtime layer around retries, replay, inspection, cancellation, and scheduling.

Core Idea

  • define workflows declaratively with triggers, payload contracts, steps, transitions, and retries
  • run them durably on top of an existing Repo and Oban
  • inspect run history with step and attempt details
  • replay and cancel runs through a small public API
  • activate recurring workflows through cron-backed triggers

Current Features

  • manual triggers
  • cron triggers
  • step retries with exponential backoff
  • built-in :wait and :log steps
  • run inspection with history
  • replay and cancellation
  • HTTP tool adapter support

Example Workflow

defmodule Content.Workflows.PostDailyDigest do
  use SquidMesh.Workflow

  workflow do
    trigger :daily_digest do
      cron("0 9 * * 1-5", timezone: "Etc/UTC")

      payload do
        field(:feed_url, :string, default: "https://example.com/feed.xml")
        field(:discord_webhook_url, :string)
        field(:posted_on, :string, default: {:today, :iso8601})
      end
    end

    step(:fetch_feed, Content.Steps.FetchFeed)
    step(:build_digest, Content.Steps.BuildDigest)

    step(:post_to_discord, Content.Steps.PostToDiscord,
      retry: [
        max_attempts: 5,
        backoff: [type: :exponential, min: 1_000, max: 30_000]
      ]
    )

    transition(:fetch_feed, on: :ok, to: :build_digest)
    transition(:build_digest, on: :ok, to: :post_to_discord)
    transition(:post_to_discord, on: :ok, to: :complete)
  end
end

Running from Host App

SquidMesh.start_run(Content.Workflows.PostDailyDigest, %{
  discord_webhook_url: System.fetch_env!("DISCORD_WEBHOOK_URL")
})

Under the Hood

  • Oban for durable execution and scheduling

  • Jido for custom step actions

  • Postgres for persisted run state

Status

This is still an early alpha release.

Current focus:

  • API shape

  • workflow contract

  • runtime model

Links

9 Likes

A bit more context on why Squid Mesh uses both Oban and Jido.

Oban is the durable execution backbone. Squid Mesh owns workflow state, retries, replay, cancellation, inspection, and step progression, but it still needs a strong execution layer underneath for queueing, scheduling, redelivery, and background execution across restarts and deploys. I didn’t want Squid Mesh to re-implement a job runtime when Oban is already very good at that layer.

Jido is currently used mostly at the step/action boundary. Right now that means custom workflow steps can be expressed as Jido actions instead of every app inventing its own step contract. That part is useful already, but I also think Jido could play a bigger role over time if Squid Mesh grows more agent-oriented execution patterns, richer action libraries, or more AI-heavy workflow steps. So today Oban is the more foundational dependency, while Jido is more about the execution contract and future direction.

On “why not just use Oban workflows?”: I think that’s a fair question. If a team already has Oban Pro and is happy building directly at that layer, Squid Mesh may be unnecessary. The reason this project exists is to provide a higher-level workflow runtime for application code: workflows as a first-class app concept, with a DSL, run/step/attempt history, replay, cancellation, cron-backed triggers, and a small public API around them. So the goal isn’t to compete with Oban as a job system, but to sit one layer above it.

If helpful, a simple way to describe the split is:

  • Oban handles durable execution
  • Jido handles custom step actions
  • Squid Mesh handles workflow semantics

Next on the roadmap is making the runtime less linear and more workflow-shaped: dependency-based steps, parallel execution, branching/conditions, clearer step input/output mapping, and better inspection for non-linear runs. Jido will stay at the custom step boundary for now, while the main focus is making Squid Mesh itself a stronger workflow runtime.

Congratulations on the release!

This is a super interesting area for me and I have made a sizable research over the weekend so most things are still fresh in my head. What it would take me to use SquidMesh and what it does not have right now:

  • Human-in-the-loop state. You have a state for waiting that’s basically delegating to Oban’s scheduled_at. For my work I need a halt and manual unblocking facilities.
  • No compensation / undo / rollback per each step. To me the Saga pattern is an absolute requirement.
  • Related to the previous: distinction between compensation and undo (Ash’s Reactor taught me this): on the one hand you might have a partial success and then error and the “compensation” step might “finish the job” (happens for f.ex. big file read operations with streaming) and “undo” is what it says: “this has failed, we don’t want to retry, let’s just undo whatever it did”.
  • Some steps should have :irreversible undo mechanics. F.ex. posting to a 3rd party ledger in the cloud. (Though that’s not the best example because accounting operations can be voided.)
  • Access to currently accumulated state a la Ecto.Multi. I have plenty of production workflows where f.ex. a customer got a tentative KYC result from a 3rd party service and that means a downstream step in the workflow must not execute in this case and be demoted to an Oban job that checks every 24h until KYC state resolves properly and only then finish a certain step – but the workflow should still run to completion otherwise (just that this one particular step becomes a future background job).
  • No parallel workflows or dynamic adding of steps a la Oban Pro. Not crucial for me at the moment but I know I’ll need it.

IMO you should compare against Sage, Reactor, FlowStone and Runic by @MrDoops. I liked Runic the most but ultimately decided not to go for it because I can’t force a single Repo.transaction (or Repo.transact) across sub-steps i.e. while doing fan-out – Jido has the same limitation AFAIR. Runic did come super close for me though and in the end I decided to roll my own durable workflow execution engine because I’d not use most of Runic and will have to use some of its hooks to emulate what I need (which is not a bad thing; I just did not want to do it).


One interesting thing that my research kind of surfaced: I was wondering whether I can’t just use somebody’s super developed agentic framework but it turned out that:

  • Agentic frameworks more or less optimize for long-running stateful agents hooked to them, which means: routing of signals (messages) and emitting directives. That very often means that the workflow must “freeze” on idling and be “thawed” when the work resumes. And the 2-3 frameworks I checked (Jido included) maintain in-memory state which is acceptable for them because most sessions don’t last super long and because they are all readable on-resume. This does not work well for f.ex. financial saga workflows. Every single step must be auditable.
  • Workflow engines mostly optmize for short-lived multi-step transactions with non-negotiable transaction durability i.e. every step is persisted before and after execution, every termination state must be known, compensation / undo, auditable cleaning paths (so compensation and undo must also log/store stuff). The main unit of work is an operation that must be mostly reversible, mostly atomic, and fully observable / auditable.

I know your thing is new and it can’t do everything. Don’t take it as a cold shower type of criticism, please. I am making use of the interesting coincidence that this is posted literal 3 days after I finished a fairly detailed research on the topic. Use or don’t use it. We all have our own requirements. To me durability, full compensation/undo support, explicit marking of steps as irreversible, ability to force a single Repo.transaction across a step + its sub-steps and, well, support for fan-out are all non-negotiable for what I do. To you it might be other factors.

Thanks, this is extremely useful feedback!

A lot of what you listed maps pretty closely to the gaps I already felt were still there, but you made the boundary much clearer for me.

The biggest things I’m taking from your comment are:

  • human-in-the-loop as a first-class runtime concern, not just a delayed wait
  • compensation / undo / rollback semantics as a core workflow primitive
  • explicit irreversibility for certain steps
  • proper fan-out / parallel execution
  • richer access to accumulated workflow state and more flexible data flow
  • stronger transactional thinking around sub-steps and durable auditing

I also think your distinction between agent frameworks and workflow engines is right. The more I work on Squid Mesh, the more it feels like the real target is not “agent runtime with workflows attached”, but “durable workflow runtime that can happen to execute agentic steps when needed”.

The comparison set you suggested also makes sense to me:

  • Sage feels strongest on compensation and saga semantics
  • Reactor seems ahead on undo / compensation modeling and transaction-oriented orchestration
  • FlowStone feels stronger on lineage, dependencies, and approval-style orchestration, though in a more asset/data-oriented direction
  • Runic seems especially interesting for DAG/dataflow and parallel graph execution

So from where I’m sitting right now, Squid Mesh is still underpowered compared to that group in some important areas, especially:

  • saga/compensation semantics
  • HITL / unblockable waiting states
  • fan-out / joins
  • richer graph/dataflow behavior

That’s not discouraging to me, it actually helps. It gives me a better sense of which problems are real enough to be worth targeting instead of just adding features because they sound workflow-ish.

I’m very willing to implement a lot of what you suggested if I can find the right shape for it. What I’d especially appreciate, if you’re open to it, is help on the maturation process itself:

  • which of these concerns should become first-class primitives first
  • which are essential for a serious durable workflow runtime vs just nice-to-haves
  • where you think Squid Mesh should intentionally stop instead of trying to absorb every idea from Sage / Reactor / Runic / FlowStone
  • whether the “application workflow runtime” positioning is actually the right lane for it

So really, thank you. This is exactly the kind of feedback I was hoping for.

1 Like

Here’s how I’m framing Squid Mesh right now:

An embeddable workflow runtime with a clean DSL, where each step is easy to read and reason about.

That’s the core constraint I’m optimizing for:

  • workflows that feel native inside Phoenix / OTP apps
  • durable execution on top of the existing Repo + Oban
  • a simple, host-app-facing runtime API
  • definitions that stay readable, not buried in orchestration plumbing

So the approach is pretty deliberate:

Start with “embeddable runtime + readable DSL”, then let real workflow use cases pull new primitives into existence.

Not trying to front-load every orchestration feature upfront.

If that direction doesn’t hold up in practice, there’s no point reinventing what’s already out there. But so far, it feels like a gap worth exploring.

1 Like

Thinking of this a little bit and comparing against my own needs currently:

  • Compensation and generally Saga semantics;
  • Compensate vs. undo distinction;
  • Have steps that are :irreversible – better example would be a DB DELETE operation. Whether that means that during a chain undo the engine shuts down and refuses to proceed by saying “I have stumbled upon an irreversible step and refuse to finish the full undo chain” or just returns a partial success result saying “These are the steps that I have undone but these 2 steps are irreversible so nothing could be done about them”… that’s an open question. I’d probably go for the latter;
  • The accumulated state;
  • HITL mechanics, at least “pause when you get this signal / message” and “resume on receiving this payload or via an explicit command” (the latter is to just un-pause something that does not actually rely on a human so it might be unwelcome and a scope creep, now that I think of it).

What can wait:

  • Fan-out / parallel. Most business processes don’t need this.
  • Any DAG or conditional branching primitives: I am only mentioning this because my research showed me some people need this and others encode it in workflow engines. I admit I have zero interest in it but I can see how it can be useful (f.ex. what I mentioned earlier about KYC outcomes and where do we go from there).
  • Dynamic step injection a la ObanPro’s grafting (I suppose). Very low-prio for me. I don’t think too many people generally need this but I have had cases where it was needed and I recognize it’s useful.

To me your idea to position SquidMesh as a workflow engine that can happen to also execute some agentic flows is the right thing to do – there are literal thousands of agentic frameworks out there and IMO you don’t want to compete with them. I’d normally go for Sage but it does not have persistence, and I cannot get Oban Pro for every project I work on (stakeholders don’t approve the expense). So your engine can be somewhere in the middle of Sage/Runic and Oban Pro, it seems to me.

1 Like

Alright. I’ve put some of your ideas to my roadmap: Milestones · ccarvalho-eng/squid_mesh · GitHub

Milestones order:

  1. Graph Workflows
  2. Operational Workflows
  3. Durable Recovery
  4. Saga Semantics
  5. Maturity And Positioning
1 Like

Released squid_mesh 0.1.0-alpha.2.

This update adds a few important workflow/runtime pieces:

  • dependency-based workflows with after: […]
  • explicit error routing with transition(…, on: :error, …)
  • explicit step input / output mapping
  • improved inspection for non-linear runs with graph-aware step history
  • a round of runtime hardening around dependency scheduling and concurrency

Also expanded the example host app and smoke coverage to exercise these paths end to end.

Hex: squid_mesh | Hex
Docs: squid_mesh v0.1.0-alpha.2 — Documentation
Release notes: Release v0.1.0-alpha.2 · ccarvalho-eng/squid_mesh · GitHub

Next milestone is focused on human-in-the-loop workflow primitives.

2 Likes

I published squid_mesh v0.1.0-alpha.3.

This release adds human-in-the-loop workflow support:

  • :pause steps can move a run into a durable paused state and resume through
    SquidMesh.unblock_run/2.
  • approval_step/2 adds an explicit approve/reject contract for manual review gates, with
    SquidMesh.approve_run/3 and SquidMesh.reject_run/3.
  • inspect_run(..., include_history: true) now includes audit events for pause, resume,
    approval, and rejection.
  • Paused and approval runs persist their resume targets and output mapping, so already-paused
    runs keep the same behavior across restarts and deploys.

The release also hardens duplicate delivery, cancellation, retry, dispatch-failure, and stale-running-step behavior. Stale running step reclaim is now opt-in; by default, duplicate/redelivered jobs skip already-running steps instead of starting another attempt after a timeout.

Install:

{:squid_mesh, "~> 0.1.0-alpha.3"}

Release notes: Release v0.1.0-alpha.3 · ccarvalho-eng/squid_mesh · GitHub

The production-readiness warning still applies. External side-effect steps should use application-owned idempotency keys or another duplicate-safety strategy.

1 Like

Awesome, love the progress. :023:

Forgot to ask you earlier: do you plan to add an “explanation” function of sorts like Journey does? Basically it can tell you “this has failed because X” or “this did not resume because Y”.

EDIT: talking about Journey.Tools.introspect/1.

Yes, that is very much in scope.

Right now Squid Mesh exposes the raw ingredients through inspect_run(..., include_history: true): run status, current step, step runs, attempts, errors, retry state, pause/approval audit events, etc. But that is still more of a structured runtime view than an explanation layer.

I do want something closer to Journey.Tools.introspect/1: a public function that can answer
questions like:

  • why is this run stopped?
  • why did this step fail?
  • why did this run not resume?
  • is it waiting on approval, retry delay, dependencies, cancellation, or a terminal state?
  • what operator action is possible next?

I’d probably keep it separate from inspect_run/2, maybe something like
SquidMesh.explain_run/2 or SquidMesh.introspect_run/2, so inspect_run remains the factual
read model and the explanation API becomes the higher-level diagnostic layer.

This release moved a lot of the needed data into durable state, especially pause/approval resume metadata and manual audit history. So the next step is mostly shaping that data into a stable explanation contract rather than scraping logs or recomputing from current workflow code.

Appreciate any architectural feedback too. Originally, I was thinking on mostly building an “n8n” using Elixir macros/functions (it’s easier to QA and load test compared to drag and drop diagrams) but I think I am sliding into a more serious pit :sweat_smile:. I love the idea that people can run a re-usable workflow framework they can use for free but at the same time, I want this to be as reliable as possible even for mission critical things like payments.

2 Likes

Oh, absolutely. Machine-readable introspection state would be VERY valuable to have so let’s not lose that.

Pretty good instinct, bravo. You can still have it later but don’t get distracted while you are working on building blocks and baseline functionality that you want to have.

Trust me I get it. I have thousands of code lines to review in my SQLite3 Ecto driver. Keeping both me & Claude’s eyes on target – “let’s have full parity with Ecto, no excuses, if something is difficult we are still doing it” – and not doing some other stuff like “but how do we limit writers so SQLite does not overwhelmed as that’s a documented limitation of the database itself?” takes actual willpower and energy.

Thank you for the vote of confidence. My life phase is still “barely finding time to make my wife feel like a woman in a relationship” but every now and then (1-2 times a month) I can scrounge some extra time. No promises.

That’s kind of why I started engaging with you in the first place. A lot of libraries start and end as hobby projects. But when you have stuff to orchestrate f.ex. you have created a user in an external Auth0 system, or posted entries in a SaaS financial ledger, or have in fact called Stripe and have now paid for something, or even stood up a remote infra (micro VM) as part of your saga workflow etc. then this is where the interesting technical work lives as well – deeply thinking about each step’s undo / compensation semantics, aggressively reordering steps so you don’t touch 3rd party systems before you make super sure that all your validations pass and all necessary DB and/or cache records are created/updated/deleted, and making sure all your return values from functions are good and machine-parseable.

1 Like

I’m thinking about dogfooding Squid Mesh in a separate Phoenix side project: a Telegram bot that delivers Hacker News RSS digests.

The idea is intentionally small but realistic. A bot like this would exercise the host-app workflow contract without needing a huge application:

  • scheduled workflow starts
  • fetching an external RSS feed
  • deduplicating already-sent items
  • formatting Telegram messages
  • retrying failed Telegram API calls
  • user subscription/preferences commands
  • disabling or cancelling deliveries
  • operator inspection with inspect_run/2 and explain_run/2

A possible workflow could look like:

workflow :deliver_hn_digest do
  step :fetch_feed, FetchHackerNewsRss
  step :dedupe_items, DedupeItems, after: :fetch_feed
  step :rank_items, RankItems, after: :dedupe_items
  step :format_message, FormatTelegramDigest, after: :rank_items
  step :send_digest, SendTelegramMessage, after: :format_message
end

This feels like a good proving ground because it has real side effects and failure modes, but the domain is easy to understand. It should quickly show whether Squid Mesh feels good inside a normal Phoenix app with its own Repo, Oban, migrations, and operational surface.

One thing I’ve been thinking through is reliability. I was starting to consider adding a heartbeat/lease system for running steps, but I’m now leaning toward that being premature.

My current thinking is that Squid Mesh can still be reliable without custom heartbeats if the
contract is clear:

  • Oban owns job durability and redelivery
  • Squid Mesh persists workflow runs, step runs, and attempts
  • duplicate deliveries are guarded at the workflow layer
  • external side effects need idempotency keys or duplicate-safe behavior
  • long waits should be modeled as scheduled continuation, not sleeping workers
  • long-running in-process steps should be avoided or treated carefully

So heartbeat/leases may be an advanced feature for long-running worker-held steps, not something the library needs before it is useful. But I may be wrong here, and I’d be interested if others think custom leases/heartbeats are necessary earlier for this kind of workflow runtime.

Longer term, this bot could also become a place to try Jido-powered agent_steps:

  • summarize stories
  • classify topics
  • personalize rankings
  • explain why a story was included

The goal would be to learn from actual usage before adding heavier runtime features like
leases, heartbeats, saga semantics, or deeper agent integration.

2 Likes

I started an empty repo with a PLAN.md

Dogfood app is done. Documented my findings here

1 Like

Squid Mesh 0.1.0-alpha.4 is out.

This release adds:

  • SquidMesh.explain_run/2 for operator-facing run diagnostics, including the current reason, valid next actions, and supporting evidence
  • multiple triggers per workflow, so a workflow can expose any mix of manual and cron entrypoints
  • a minimal host app example showing one workflow started manually or by cron
  • a fresh current-schema install migration from mix squid_mesh.install
  • structured :invalid_run_id errors for malformed public run IDs

Package:

Docs:

GitHub release:

One note: this is still alpha. The migration installer now emits one current-schema Squid Mesh migration and does not include a compatibility path for older split Squid Mesh migrations.

1 Like

One design note for the roadmap: I still expect Squid Mesh to keep using Jido over time, but likely more as an internal execution/action substrate than as the primary user-facing abstraction.

The direction I’m leaning toward is:

  • Squid Mesh owns the durable workflow contract: triggers, persistence, retries, replay, audit, inspection, and host-app integration.
  • A native SquidMesh.Step contract becomes the preferred custom-step API.
  • Existing Jido.Action modules remain supported as an interop path.
  • Internally, Squid Mesh can keep using more of Jido where it helps without making every Jido concept part of the Squid Mesh public API.

That should keep the authoring model focused on Squid Mesh while preserving the option to lean further into Jido for agentic/runtime pieces.

cc @mikehostetler

1 Like

Longer term, this is something I could imagine building after the runtime matures further: a visual companion for inspecting Squid Mesh workflows and run state. For now I’m staying focused on the runtime, durability, and host-app integration, but I like the idea of eventually making workflow shape, transitions, retries, and manual gates easier to see at a glance.

3 Likes

Released squid_mesh v0.1.0-alpha.5.

This adds step recovery markers for irreversible/non-compensatable side effects, persists recovery policy into step history, surfaces it in inspection/explanations, and blocks replay by default unless explicitly overridden with allow_irreversible: true.

Hex: squid_mesh | Hex
Docs: squid_mesh v0.1.0-alpha.5 — Documentation