Building a ComfyUI-like system in Elixir

The origin of this idea goes beyond just building “Phoenix’s ComfyUI.” I want to combine BEAM’s parallel performance and fault tolerance with elegant syntax (even creating a friendly DSL), which are precisely the reasons I started learning Elixir.

Before posting, I explored existing FBP (Flow-Based Programming) related works. However, most packages either don’t fully align with my vision or are domain-specific (e.g., web-focused applications like Ash/Reactor).

While I welcome discussions about flow programming/job orchestration feasibility in Elixir, I want to ground this conversation in practical context to avoid purely theoretical debates.

My motivation stems from an ongoing project: building a singing synthesizer interface/app that integrates multiple models (initially aiming to implement DiffSinger’s Elixir wrapper with a simple WebUI).

For those familiar with song synthesizers like Vocaloid: generating audio from lyrics involves multiple steps and requires parameter adjustments at various abstraction levels (note duration, syllable timing, pitch curves, etc.). This complexity necessitates choosing different models/phoneme dictionaries, which inspired me to build a tool that lowers the usage barrier.

A ComfyUI-like system seems ideal for this scenario. A “track” could be defined as a workflow of interdependent tasks, where individual tasks might require operations from others.

I’ve created a repo under SynapticStrings/QyEditor: “Lightweight synthesizer interface.”. Currently, the project is primarily in Chinese due to my limited English proficiency and intentional delay in internationalization (docs/comments/etc.) until the core is stable.

The architecture splits into two applications:

  1. :qy_core handles parameter manipulation and chaining (similar to Plug’s philosophy).
  2. :qy_flow manages parallelism, task scheduling, and process orchestration using libraries like GenStage/Flow.

Key questions:

  1. Is building a ComfyUI-like system in Elixir feasible?
  2. Am I on the right track with this architecture? What should be my next steps?

P.S. I know the optimal path would be to implement DiffSinger’s pipeline in Elixir first. However, calling its ONNX model via Ortex throws errors, and my limited Rust/ML debugging skills prevent me from resolving this.

1 Like

It is very likely possible, but would require a lot of community work, and a few contributors.

Now even more possible with PythonX

1 Like

Hi everyone,

A while ago, I asked for advice on the feasibility of building a node-based, ComfyUI-like workflow system in Elixir. My long-term goal is to leverage Elixir’s concurrency and OTP to build a better, more maintainable AI workflow engine, hopefully avoiding the spaghetti graphs and poor abstraction often seen in current Python-based node tools.

Here is a quick update on my progress!

1. What’s Done: The Core Engine & ML Integration

I haven’t built the visual node editor yet, but I have successfully completed the core pipeline engine, published as Orchid and OrchidSymbiont.

To prove it works in the real world, I’ve successfully implemented an end-to-end inference pipeline for DiffSinger (a Singing Voice Synthesis model) entirely in Elixir(The first one is tried to refactor my blog engine, but I’m not good at front-end).

You can check out the working proof-of-concept here in this Livebook:
simple_run.livemd (Note: Running it requires downloading the model to your divice(~500MB)).

Here is the result generated purely through Elixir (Mel-spectrogram, F0 pitch curve, and Phoneme boundaries rendered using VegaLite):

2. What’s Next: Visual Interface

Currently, the data pipeline is constructed programmatically via a manual DSL. It looks something like this:

Whole steps in pipeline
[
  {DurationPredictEncoder, :words,
   [:duration_lang, :duration_phoneme, :word_division, :word_duration, :duration_ph_midi], []},
  {PredictDuration,
   [:duration_lang, :duration_phoneme, :word_division, :word_duration, :duration_ph_midi],
   :phoneme_duration_predict, [extra_hooks_stack: [Orchid.Symbiont.Hooks.Injector]]},
  {PitchPredictEncoder, :words,
   [:pitch_lang, :pitch_phoneme, :word_duration_from_pitch, :pitch_ph_midi], []},
  {PredictPitch, [:pitch_lang, :pitch_phoneme, :phoneme_duration_predict, :pitch_ph_midi],
   :pitch_pred_midi, [extra_hooks_stack: [Orchid.Symbiont.Hooks.Injector]]},
  {fn %Orchid.Param{payload: midi_pred}, _opts ->
    {:ok, midi_pred
      |> Nx.backend_transfer(Nx.BinaryBackend)
      |> Nx.add(-69.0)
      |> Nx.divide(12.0)
      |> then(&Nx.pow(2.0, &1))
      |> Nx.multiply(440)
      |> then(fn converted_f0 ->
           Nx.select(Nx.less(Nx.backend_transfer(midi_pred, Nx.BinaryBackend), 0.0), Nx.tensor(0.0), converted_f0)
         end)
      |> then(&Orchid.Param.new(:f0, :tensor, &1))} end, :pitch_pred_midi, :pitch_pred, []},
  {VarianceEncoder, :words,
   [:variance_lang, :variance_phoneme, :word_duration_from_variance, :variance_ph_midi], []},
  {VarianceModel, [:variance_lang, :variance_phoneme, :phoneme_duration_predict, :pitch_pred_midi],
   [:breathiness_pred, :voice_pred], [extra_hooks_stack: [Orchid.Symbiont.Hooks.Injector]]},
  {Acoustic,
   [:variance_lang, :variance_phoneme, :phoneme_duration_predict, :pitch_pred, :breathiness_pred,
    :voice_pred], :mel, [extra_hooks_stack: [Orchid.Symbiont.Hooks.Injector]]},
  {NSFHifiGAN_Vocoder, [:mel, :pitch_pred], :wave_tensor,
   [extra_hooks_stack: [Orchid.Symbiont.Hooks.Injector]]},
  {TensorToWave, :wave_tensor, :audio, [sample_rate: 44100]}
]

The immediate next step is mapping this backend execution engine to a Visual Node Interface.
I am planning to use Phoenix LiveView integrated with a frontend node library (like SvelteFlow or LiteGraph.js), or perhaps try packaging it as a Kino Smart Cell for Livebook first.

3. Challenges & Looking for Advice

As I design the frontend-backend communication, I’m facing a few architectural challenges and would love to hear your thoughts:

  • State Synchronization & Heavy Payloads: DiffSinger’s inference process generates a massive amount of 1D data sequences (pitch curves, breathiness, voicing, gender, etc.). If these dense arrays are sent directly to the front-end JS editor over WebSockets to draw the curves, it will cause catastrophic latency and UI lag.
    • My current idea: Convert these heavy 1D sequences into parameters of a Bézier curve on the backend, and only sync those lightweight control points to the frontend as the ground truth. Does anyone have better strategies for syncing heavy chart data via LiveView?
  • Bypass Caching & Incremental Generation: I plan to implement step-level result caching hook so that if only one downstream node changes, the entire pipeline doesn’t need to be re-run. I’ve temporarily shelved this due to the complexity, but it’s high on the roadmap.

I’d appreciate any feedback, PRs, or just general discussion on the architectural design!

For better precision at encoding the sequences as 2D curves you might want to take a look at non-uniform rational B-splines (NURBS) instead of Bézier curves. I’m not sure NURBS curves can be accurately described by SVG though for correct visual representation.

For the flow architecture the best I’ve encountered is Labview which has been at it for decades. Other inspirations for architecture and UI could be Grasshopper or the flow scripting of Houdini.

(Most likely someone has already made graphs in Grasshopper to go from 2D points/ frequencies to precise curves. However while you can insert C or Python nodes I don’t know of any way to take an entire Grasshopper graph and get C or Python out. That would actually be quite interesting, but well, that is a subject for some other forum).

Thank you for your reply, it’s very professional and grounded in real-world industry experience.

First, regarding the design of the curve tool, it originates from Cadencii. It chose Bézier curves as the curve tool, which is much better than drawing by hand with the mouse, so I just selected it(tutorials and demos are too old to find, but the source code does indeed contain the Bezier tool).

At the stage of my first post, I had just moved from higher-order Bézier curves to a set of multiple Cubic Bézier.

I originally intended to continue implement this feature, but after receiving your reply, I discussed it with Gemini and found that using B-spline curves was better. On the one hand, this curve tool is suitable not for 2D CAD drawings, but for 1D time series; on the other hand, B-splines are better than Bézier curves when it comes to curve fitting (backtracking a dense sequence of discrete parameters into a sparse set of control points).

I’ve done some studies on the projects you mentioned, but unfortunately, I lack both professional and work experience with this topic. So I’ll just say a few words.

Once a project based on visual programming reaches a certain level of complexity, it becomes indeed difficult to do much more (including editing and fetching runtime-agnostic code from code). In the case of the scheduling engine I wrote, I added several layers of wrapping and unwrapping to the data in order to implement data validation.

Regarding your point, I was actually quite apprehensive about it during the design process. Therefore, I anticipate that there should be a certain limit to the number of nodes that users can interact with (currently there is no such constraint, but I hope it will become a convention). If this limit is exceeded, a refactoring will be necessary (e.g., merging nodes at the code level or taking similar measures).

However, if this scheduling engine continues to develop, it may be possible to merge some nodes into an AST/Code based on Macro after a large-scale API refactoring. Maybe in the future someone with a blue, pink, and white flag background and an anime-style avatar will be able to do this.

Let me add some updates on the current progress. The post that posted on 5th was generated by AI after the PoC pipeline was done, and I didn’t look at it carefully because I was a bit overwhelmed.

The original idea for this post is at the Designing and Scaffolding of an Online Editor - OwnSpace, but the content is not written in English. If someone want to see it, you’ll have to find a translate plugin or send it to an LLM.

Specifically, the idea is from,

To search and organize models in the voicebank library, then determine the dependencies between models, thereby achieving stronger compatibility.

However, in the case of DiffSinger, it’s a typical serial task rather than a DAG schedule (appended code from current livemd script).

Its dependencies are not complicated, but the internal architecture of models are quite complex.

It’s easy to understand why some core members in the OpenVPI(main maintainers of DiffSinger) is pessimistic about it.

I used to naively believe that as long as I persevered, I could always solve any problem. But now I realize that some problems can’t even be called problems because they are meaningless; their only purpose is to prove that “they are meaningless”.

The editor will probably continue to be developed, as it’s an idea I’ve had since high school, but I won’t be as obsessed with visual programming anymore; it will likely be developed more as a feature.

Maybe, perhaps, that’s all.

It’s fine to use LLMs while coding but I don’t want to read posts here generated by AI. If you were overwhelmed, you could have taken a few days to think what you, as a presumed human, wanted to say. If the AI wants to post, there are forums for that.

1 Like

I’m not familiar with diffsinger, or the problem space, but I have used visual programming a fair bit over the years for various tasks. So I think I can say something about that part at least. Indeed I have earlier mused that many Elixir programs, with their modules and connections visualized with a Mermaid graph, could look very familiar to visual programming.

I would not choose visual programming for the UI alone. It can be very elegant with small programs, but for big programs it can easily turn into real life visual spaghetti code. At which point making components, color coding and or framing entire areas of code with what they actually do is a good idea. Then you have bigger components which again hides detailed complexity and allow easy overview.

My experience is that visual programming have some strong points:

  1. Isolation, reuse and making components. This is just trivial, and the reuse is real.
  2. Refactoring. Again, often just trivial cut and paste and move around.
  3. Parallelism. You can make independent parallel paths of visual code and you will get programs that runs that code in parallel. (Depending on actual hardware, and not all visual programming languages do this equally well). Back in the day I did FPGA programming using Labview, and it was like painting code onto actual hardware.
  4. UI. You see your program flow very easily. Depending on the language programming concepts like loops can be less elegant or just not possible at all.
  5. Debugging. Issues are often highlighted at the exact spot or module. Data comes in, but data doesn’t come out. Making a type breaking connection is usually impossible and thus found before even trying to compile. (Again, depending on language).

Like Elixir visual programming is also about transforming data, through visible steps and with connections between independent modules. Like functional programming the variables don’t change so if you want to reuse the value of a variable from earlier in the graph just make a connection to it at that stage.

The bypass caching & incremental generation you mention is a necessity during development of big graphs, or any minor change will end up triggering entire graph recalculations.

I don’t see why the internal architecture of components would make a visual UI for composing them any harder? The components should be independent, and with explicit inputs and outputs only, so what goes on inside is just their business. Indeed many visual languages make it easy to use different languages inside of components to better suit the components task. For composition the input and outputs, with types and data shapes, should be enough.

1 Like

Okay. For the real people with grammatical errors and typos, will have a highly tolerance. I’ll try to speaking myself rather than use LLM.

I should call you senior in this field, so my reply is just simply to share my ideas.

Actually, for the scenario(inference DiffSinger’s onnx model with visual), it’s not difficult, it’s meaningless. The inference task not fit this(seperate steps or procedures into isolation steps, then integrate the building blocks with visual nodes).

There’re two aspects to interpretation.

  1. Though there’re difference models, but they’re not decoupled completely like Stable Diffusion model. Because these models (from a single voice bank) are all trained from the same dataset(usually a person sing some songs while recording, and then attach some annotation). Because the diversity and uncontrollability, something like put A’s variance model apply to B’s acoustic model may cause some annoy voice. In a nutshell, models’ dependencies are consolidate realtively.
  2. You can see the first graph in 6th reply. The inference is acutally a serial task, Pitch Predict requires phoneme duration from Duration Predict(I chose the module name randomly so didn’t use it here), and so on. So you can see it’s like execute serial task in the following Gantt graph.

Then we can discuss the community, which are the expected participants in the mock product.

For SVS(Singing Voice Synthesis), it’s quite niche. And this is one of the few areas where creators or artists are not very hostile towards AIGC.

It has two reasons. First, all training set can be traceable and there’s less copyright controvasual than AI drawing or some stuff.

Secondly, current editors give the creators enough space to make their own product. My PoC use a end-to-end pipeline, but many parameters are able to modify mannually in Editors(like Vocaloid, OpenUTAU, Synthesizer V, etc.). The ability to control its work provides stability for the creator.

This approach to visual programming is based on the latter reason, this opinion also points me a lot. For the people whose coding level is not enough, but they have idea to fully customize some flow or procedure.

Let’s come back to SVS, I don’t know what the community is like in other countries and regions. But in here, this falls under the broader-Vocaloid community. Technology, song tuning, and artwork are all mixed together, and there are many immature young minors involved. The boundaries between producers and appreciators are extremely blurred.

For the editors, it’s most important part to attract users is make UX simple, but if you apply visual programming, it’s not simple at all. Frankly, I don’t know how to make the trade off.

From the “rational man” perspect, this is actually a marketing issue, which target is weak purchasing power. Moreover, the customer base is VERY EMOTIONAL, so if something has happend and PR not handled well, it can easily provoke a public opinion backlash.

But if just wrap some audio process tookit into a WebUI, build a long-life ETL process and automatic execute, it’s perfectly suitable. That’s what BEAM does.

There are too many topics here for me to sort out:

  1. Implementing data flow based visual programming for Elixir, making the UI, and I assumed by using Liveview. That is what your first post started with, and as I have some experience with visual programming that is what I’ve replied to. And no, it is not simple to make a good UI or backend for that. Labview, Houdini, Maya, Grasshopper and many more have been at it for decades with corporate resources behind them. And they still have work to do.
  2. Your practical context as you said to avoid purely theoretical discussion. The chosen example is not my field, but the Bézier curves you suggested are unable to make any curve shape that might be needed so I suggested curves that can. Apart from that I have no idea of the practical example problem space.
  3. Your Mermaid graph is indeed a graph, and to the extent an end user needs to change that graph some visual approach makes sense to me.
  4. Onnx, AI and training. Ok, but hardly centered on Elixir visual programming or its UI? It seems to me that implementing the practical example is the actual topic of your thread.
  5. Communities, countries, technology, tuning, artwork, IP rights, marketing, emotional customers, public opinions and backlash. I have nothing more to contribute here I think.
1 Like