For those who are not aware, “AI agents” are, for the most part, commodity LLMs which are given access to “tools” and prompted to complete tasks, possibly in some sort of loop.
The tool use is facilitated by a program which scans the output text of the LLM and looks for a “tool call” request (in some standard format), and then executes that call. For example, you might give the model access to a “calculator” tool which enables it to do math, or a “weather API” tool to check the weather. And so on. The model is given a prompt which tells it what tools it has access to, and I believe most models coming out nowadays are trained to some degree on tool use so that they get the general idea.
The “agentic” behavior here is somewhat arbitrary, but the idea is that you have some sort of feedback loop. The model generates a tool call, receives the result, and then perhaps generates more calls based on that result. People have been using this to write code, for example, with (so far) limited success.
The current emerging “killer app” for agents is the “deep research” model, which has been adopted by google, openai, perplexity, twitter (lol), and so on. The basic idea here is that you give the model a “search engine” tool and then just prompt it to run in a loop searching, reading results, and then coming up with more searches. Then it generates a nice summary (“report”) at the end for human consumption. It goes without saying that this task is a lot easier than writing code, and as a result agents seem to be actually “catching on” for the first time.
Due to the autoregressive nature of current LLMs, which has proved to be quite sticky thus far, they perform extremely poorly for “local” use. Current autoregressive models require the entire model to be run through the GPU’s registers on every forward pass just to generate one token. As a result, “local” inference is completely bottlenecked by memory bandwidth. If you have a 30GB model (on the low end of “useful”), and a GPU with 600GB/s memory bandwidth (that’s pretty good), you would expect 20 tokens/sec (fairly usable). Unfortunately GPU memory bandwidth is expensive and 30GB is not enough for a top tier model.
However, this problem vanishes with batching. GPUs are built for parallel compute, and deep nets are built to utilize it. If you batch, say, 10 requests at a time, all of a sudden you are getting 200 tokens/sec on the same hardware (flops notwithstanding). The point being: there is a forcing function towards multitenancy. This is why everyone is using cloud APIs instead of running their own models - the cost reduction is enormous.
What this means is that “AI agents” are actually just glue code for interacting between LLM APIs and “tool” APIs. And that’s where Elixir comes in: we are very good at soft-realtime. Elixir and the BEAM are the ideal ecosystem for this. LiveView is the perfect tool for server-side realtime UI. If you were going to build some sort of “agentic” app, this would be the platform.
So I’m curious, is anyone doing something in that space?