An Elixir library for building multi-agent LLM applications. ExAgent abstracts calls to various LLM providers (OpenAI, Gemini, DeepSeek) via an extensible Protocol and orchestrates them using OTP primitives with four multi-agent design patterns: Subagents, Skills, Handoffs, Router.
Features
Protocol-based LLM abstraction — Swap providers without changing application code
Built on OTP — Agents backed by GenServers, supervised processes, async Tasks
Automatic tool execution — Define tools once, the agent loops LLM calls until complete
Please tag the github branch so that </> links from hexdocs work.
[edit]
AFAICT, chat with thinking models would not return until the full result comes. That’s not exactly what one expects from chat_async/3.
I find the library very useful in general and covering some my needs in particular, but the approach to asynchronous integration of assistants is clumsy requires a lot of improvement, IMHO.
If I understood correctly, it sounds like the concern is more about streaming vs non-streaming behavior, rather than thinking models themselves.
If that’s the case, I agree that chat_async/3 can feel limiting for streaming scenarios, since you’d need to keep the connection open to receive partial results. The current approach, however, only returns once the full response is ready which is the most common scenario, especially when you need structured outputs like JSON (since partial JSON doesn’t work well).
On the other hand, for non-streaming use cases, the distinction between thinking and non-thinking models doesn’t really affect the async behavior. The main difference is just additional tokens (like <thinking> in models such as DeepSeek R1), which can be filtered out programmatically if needed.
Happy to discuss improvements or explore alternative approaches here.
Right, but non-thinking model’s responses cannot be streamed that’s why I found both statements are describing the same issue.
Why do you think it’s the most common scenario? Why do you think partial JSON doesn’t work well? If the latter was the case we would never have had streaming JSON parsers and yet nearly all of those are streaming.
Yes, the sole purpose of this token is to make the streaming possible.
The alternative approach is evident: chat/3 function uses a deferred call to stay blocking (handle_call/3 returns {:noreply, state} and spawns a Task.) The caller of chat/3 should provide a listener, which receives portions of the stream. After the response is fully received, the spawned task calls reply/2 to accomplish a message loop in the Agent and dies.
The next iteration would be use a pool of ExAgent.Agent because in my scenario this library might handle requests from many clients and the sole Agent might quickly become a bottleneck.
Actually, non-thinking models stream perfectly fine as well.
The “issue” isn’t that they can’t stream it’s that when they stream JSON, the data is technically “broken” until the very last character (the closing brace }) is received.
While streaming JSON parsers exist, they don’t magically make the data usable in the way most applications need. Here is why it’s considered the “most common scenario” for failures:
• Syntax Invalidity: At any point during the stream, the string is likely invalid. For example, if the model is halfway through a key: {"user_na. A standard JSON.parse() will throw an error immediately.
• State Management: To use partial JSON, you need a specialized “Recursive Reducer” or a “Partial Parser” These tools try to “close” the JSON manually (e.g., adding " and } to whatever is there) so you can read it.
• Incomplete Values: If you are streaming a UI and the model sends “description”: “The weather is rea”, your parser might give you that string, but your application logic might not know what to do with a half-finished sentence.
Feel free to push a PR for this async feature as long as it handles both cases stream / non-stream it will be quite welcome.