Suggestions for workflow & automation sequencing

MrDoops · June 6, 2020, 7:45pm

I would recommend reading up on data flow graph models, behaviour trees, and state machine specifications. The idea of a data flow graph is that you have a series of steps as nodes where each step is a node and the edges connecting nodes represent a data flow dependency. I.e. given a piece of data fed into this “workflow” model the output of one step is fed in as the input to the next step.

Regarding workflows like you describe you also need control flow constructs, rules (if this pattern, then do this) being the primitive. There’s a lot of research rabbit holes here worth investigating like RETE rules engines that allow for performant evaluation of user-defined rules. RETE is pretty complicated, but the basic idea is worth mentioning:

A rule in a data flow model is really two steps:

The Pattern Matching Function/Step/Conditional Expression: Tell me when a piece of data matches some pattern (if this).
And the actual “work” function: (then do this) e.g. a Step/Job

The basic idea behind most of these rules engines is that the pattern matching function is attached to the top of the data flow graph and the “work” functions are attached as nodes dependent to the pattern functions.

Further logical constructs such as AND / OR can be managed as dependent patterns. So if you have a logical expression such as IF X AND Y THEN DO Z we could model that as something like:

+-------+    +-------+      +-------+
|       |    |       |      |       |
|   X   +---->   Y   +----->+   Z   |
|       |    |       |      |       |
+-------+    +-------+      +-------+

Where for step Z to be ran, both X and Y have to be true.

This doesn’t really get interesting until we add other rules such as IF X and F THEN DO B, so we now have a shared dependency on condition X and some new dependencies. So were we to add this rule to our graph we’d get something like:

             +-------+      +-------+
             |       |      |       |
    +-------->   F   +------>   B   |
    |        |       |      |       |
    |        +-------+      +-------+
+---+---+    +-------+      +-------+
|       |    |       |      |       |
|   X   +---->   Y   +----->+   Z   |
|       |    |       |      |       |
+-------+    +-------+      +-------+

What’s even more interesting about this is that our data model of data flow dependencies also reveals potential parallelism between steps. F and Y are both dependent on the result of X but not dependent on each other, so assuming step X produces a result, F and Y can then be ran in parallel. The concurrency opportunities of modeling computations in a DAG are why these structures are used extensively in a lot of domains.

You can find Directed Acyclic Graphs (DAGs) just about everywhere once you start looking. Git, Tensorflow, Apache Beam (not our Beam but still a cool Beam anyhow), and Apache Airflow all use DAGs in some capacity. There’s a lot of fun papers and a lot diverse domains they’re used in like expert systems, dynamic workflow modeling (more of what you’re looking for), game agent AI (usually called behaviour trees but they’re essentially the same thing if you squint a bit) and signal processing.

Like an AST a data flow graph is a data-structure that represents some computation that can be run given some input. The difference is something like the Elixir AST is manipulated at compile-time whereas a Graph or nested map data-structure of some kind is something you can manage at runtime. This also exposes a fairly difficult problem: how do we verify correctness of a program we’re throwing together at runtime? Compilers and type systems are non-trivial. Is step X flowing into step Y producing an incompatible input? Compilers are hard and doing these sorts of checks at compile time let a lone runtime is not easy. My own research has run into very large state spaces and at this point I’ll need learn lots of dynamic property based/generative testing based on layers of properties and an uncomfortable amount of category theory jargon.

Anyway super fun research area and an area I think the rich capabilities of Elixir/Erlang’s runtime can really shine and make new strides in.

I’d recommend watching these lectures by the late, great Patrick Winston.

This talk is a great overview of dataflow models. and the paper the talk is referencing is a great source of more papers.

Here’s a nice doc describing RETE: https://cis.temple.edu/~giorgio/cis587/readings/rete.html

Here are some Elixir libraries worth checking out in no particular order:

That said, you can do all of this with vanilla elixir, structs, and a bit of cleverness passing functions around or using protocols/behaviours. The idea is that you build a model of the computation as a data structure separate from the execution at runtime, how you implement this can be optimized depending on your domain.

Hope this helps!