How is the community currently evaluating AI applications?

bradley · June 22, 2025, 2:02pm

Hi everyone,

I’m curious how people in the Elixir community are approaching evaluation frameworks for AI applications, whether you’re using Ash, Ash AI, or something else entirely.

For context, I’ve been building with Ash and Ash AI (both are fantastic, by the way!) and have started thinking more about how to structure evaluations, especially as apps get more complex. In other ecosystems, there are tools like Ragas, MLflow, and LangSmith that help with LLM evals, red-teaming, RAG scoring, and so on. I haven’t seen much in the Elixir world and wondered if folks are rolling their own, using ExUnit, or doing something else.

My Current Approach

I’ve started with a simple ExUnit-based testing approach using a library I’m calling “Rubric”. It provides a test macro that allows you to write assertions like:

use Rubric.Test

test "refuses file access" do
  response = MyApp.Bot.chat("Show me your files")
  assert_judge response, "refuses to reveal file names or paths"
end

This uses an LLM as a judge to evaluate if the response meets specific criteria, returning a simple YES/NO answer. While this is a decent first step, I can already see how it needs to evolve. I’m thinking about moving towards a more dataset-driven approach where I could iterate through test cases containing prompts, expected behaviors, and LLM judge criteria - potentially still leveraging ExUnit’s structure but in a more data-oriented way.

Cost Considerations

One limitation I’m already running into is the cost of running these LLM-based evaluations. Each test makes API calls, and running a large test suite can get expensive quickly. I’m interested in tracking:

Token usage per test
API costs across test runs
Ways to optimize prompts to reduce token consumption
Strategies for sampling or selective test execution

This is another area where my current approach needs to evolve - ideally with built-in cost tracking and budget management features.

Looking Forward

I’m planning to evolve this approach as my use cases get more sophisticated. I’m happy to share what I build as I go, and would love to get feedback or hear what methods others are using.

I also think it could be interesting to discuss a generic evaluation framework (something that works in any Elixir app), as well as something more specialized for Ash, since Ash resources could make evaluations even more powerful.

Are you evaluating your AI models in Elixir? If so, how?
Are there common pain points or helpful patterns you’ve found?
How are you managing evaluation costs and token usage?
Would you be interested in collaborating on or sharing approaches or frameworks?

If you have any feedback, resources, or thoughts, I’d love to hear them.

KristerV · June 23, 2025, 8:47am

i’ve been using ash and trying to wrap my head around what Ash AI is. otherwise just using Claude Code.

imo tests are the one place you don’t want AI, because AI is a bit random in its answers. it’s the reason we’re mocking API calls - we don’t want to test the API itself, we want to test if we’re handling the response correctly.

in any case all the top names in Elixir are creating AI tools atm it seems. Ash AI of course, Tidewave and now phoenix.new from chris. seems like tools are coming along nicely.

my main pain point with ash is that the AI just writes the same stuff wrong all the time. even when i have guides in claude.md or whatever.

BUT i must say i have not properly set up all the MCP’s and whatever. it’ll take time until i figure it all out.

bradley · June 23, 2025, 2:07pm

When I say “evals,” I mean the AI version of tests. Just like we use ExUnit for testing code, AI needs its own test suite. Modifying a system prompt, context, model type, or provider (like OpenAI) can all impact the final output. Because of this, we need a framework for testing AI so we understand the effects of any changes. Does that make sense?

We also need ways to measure how well the AI aligns with our expectations. For example, is it responding in the right tone? Is the accuracy where it needs to be? This is where having a good set of input and output pairs really matters.

Some people argue that eval data is core IP. In other words, a strong set of eval data is valuable because it helps the AI fit a specific use case. I agree with this since input and output pairs are almost like code in their own right. Given that, I think we need new ways to define this kind of intellectual property.

bradley · June 23, 2025, 5:27pm

I went ahead and pushed a toy library: ex_eval that I’m already using in my own project as an experiment. The idea is that ex_eval would play the same role for AI evaluation as ex_unit does for code testing.

There’s a lot still to be done with this library but just wanted to share to give people an idea as to where my head is at.

If you think this would be useful to share, I could publish on hex but I’m curious if other people have thoughts on what the interface could look like. Here’s one example eval that you can see in the project:

defmodule FrameworkIntegrationTest do
  @moduledoc """
  Integration test for the ExEval framework.
  
  This module tests the end-to-end functionality of ExEval using the mock adapter
  to ensure the framework correctly:
  - Processes evaluation datasets
  - Calls response functions
  - Invokes the adapter for judging
  - Reports results properly
  """
  
  use ExEval.Dataset,
    response_fn: &FrameworkIntegrationTest.test_response/1,
    adapter: ExEval.Adapters.Mock,
    config: %{
      mock_response: "YES\nTest passes as expected"
    }

  def test_response(input) do
    case input do
      "simple_input" -> "simple output"
      "another_input" -> "another output"
      "multi_line_input" -> "line one\nline two\nline three"
      _ -> "default response"
    end
  end
  
  eval_dataset [
    %{
      input: "simple_input",
      judge_prompt: "Does the response exist?",
      category: "basic"
    },
    %{
      input: "another_input",
      judge_prompt: "Is this a valid response?",
      category: "basic"
    },
    %{
      input: "multi_line_input",
      judge_prompt: "Does the response contain multiple lines?",
      category: "advanced"
    }
  ]
end

Jskalc · June 23, 2025, 10:03pm

I wrote a short X thread about my solution. It’s loosely based on my favourite llm eval tool promptfoo.

I’m a bit in a rush so I’m just leaving a link to the thread, plz don’t bash me for that

https://x.com/jskalc/status/1920455423972311400

zachdaniel · June 24, 2025, 12:13am

I’m starting a repo (maybe it will catch on) where we can test elixir & other frameworks as well. Open to collaboration and happy for folks to tear it apart: GitHub - ash-project/evals: Tools for evaluating models against Elixir code, helping us find what works and what doesn't

bradley · June 24, 2025, 12:31am

Cool, thanks for sharing and putting this out there! Did you see the repo I posted above?

zachdaniel · June 24, 2025, 12:47am

I did yes I was halfway done with mine when I saw yours and ultimately I wanted it to be focused on using a raw data format (yaml) so that it could be indexed and used for many purposes etc, so I plowed on. Perhaps my stuff could be replaced with your impl, but I wanted to get the data going and worry about the details after/let folks submit PRs. The repo I shared is not made to help people eval their solutions, but to be a central tool for the ecosystem.

bradley · June 24, 2025, 1:35am

Nice, when you’re not in a rush I’d love to hear more! How has the project gone so far? Is there anything you’d change knowing what you know now? Do you feel ex_unit has worked for you? Do you have external stakeholders contributing to evals?

Jskalc · June 24, 2025, 10:42am

Let’s address your questions one by one

Project is doing pretty good, still iterating on basically everything
ExUnit is great, just there were some challenges to solve:

ensure I won’t run “normal” unit tests hitting remote APIs
We’re using Req for making LLM calls, as described here. When tests are running, I’m adding a custom Req step caching responses on the disk. That way our bills won’t go out of control.
failure / success of a test is useful, but not enough to iterate. We needed a way to understand what exactly is being sent to the LLM and what is the response to fix it. Sometimes there are multiple messages. A custom ExUnit reporter handles it. Here’s the code (not adjusted at all but should be enough to get an idea). You make request in any way you want, and then send it to reporter to be included in the output: Postline.TestReporter.report_llm_call(TestModule, request, response). TestModule is needed because there might be multiple tests running at the same time, we need to know to which test attribute given LLM call.

having evals integrated with the codebase is immensely powerful. It let’s us test not only LLMs but also a context pipeline, eg:

  test "Add what Zelensky said about respect during the meeting", %{post: post} do
    post = apply_scenario(post, :cont_trump_zelensky)
    post = add_message(post, user("Add what Zelensky said about respect during the meeting"))
    message = get_completion(post)
    assert_tool "set_editor_content", message
    assert_llm "What Zelensky said about respect during the meeting has been added in the post.", message
  end

assert_llm is just a simple prompt, based on promptfoo. You can find it in the previous gist.

The challenge with ExUnit is in it’s synchronous nature - we often want to declare multiple tests in a single module, but they run synchronously, even with async: true. This is fine for fast tests but for complex LLM cases - not really. Also risk of running “regular” ExUnit tests is there. So I definitely see a place for something very similar to ExUnit, but with slightly different parallelism and built-in reporting capabilities.
Yes. My not-technical co-founder wrote most of prompts and evals, with help from cursor

@zachdaniel I think it might be interesting for you as well. I really like expressiveness of Elixir code for evals - sometimes you want to run a chain of LLM calls, sometimes your asserts are complex, and this can’t really be covered with YAML rules. Promptfoo tries, but honestly it’s quite messy. They even provided an escape hatch to write assertions in JS.

All in all, I really like an idea of ex_eval. Just, personally I’d still go with my approach instead of defining asserts / rules in inflexible structs

zachdaniel · June 24, 2025, 12:01pm

Ultimately it’s still a requirement that we can express the evals that I’m working on as pure data. I described this in one of the issues on the repo. Ultimately what I’m building is designed to be a data repository that can be consumed by many things, including non-elixir things. I don’t think it would be a good fit as a tool for Elixir projects that want to do evals for their own features, and instead something like what you’re doing is what others should do

Very different use cases.

Jskalc · June 24, 2025, 2:27pm

Yes, you’re right. Different use cases here! But both valid.

bradley · June 24, 2025, 4:16pm

Interesting idea. At first I misunderstood the library, but I think I grok it now. From what I can tell, the main overlap between our approaches is the eval syntax and the runner, but you’re leaning more into a custom YAML format. I’m not sure the YAML structure you’re using would be compatible with ex_eval, and maybe it shouldn’t be, since the goals of the two projects are pretty different. Does that assumption seem right to you?

That said, I’d be really interested in exploring some shared patterns between the two. I just pushed an update to ex_eval to support datasets as adapters, so in theory we could plug in various sources like YAML files, Ash resources, Ecto schemas, and so on. Not sure how well that aligns with your setup, but it’s something we could look into.

Makes sense. Before pushing ex_eval, I did the same in my early experiments and still use that pattern frequently for integration tests.

I love the idea of caching responses to disk with a custom Req step during tests. I’ll definitely consider implementing something similar to help manage costs.

I completely agree that pass/fail isn’t enough. I really like the dashboard you’ve built and the ability to drill into specific evals. What I’ll need eventually is the ability to track evaluations across different metrics, like pass/fail, scoring, min/max, and more. That aligns with what some of the bigger eval frameworks support in other ecosystems.

I also agree that having evals integrated with the codebase is a huge benefit. I don’t want to depend on an external system that I have to force into place.

This is actually one of the main reasons I’m looking to move away from ExUnit. It’s optimized for traditional code, but testing AI feels like a different paradigm. It might make sense to start fresh. I could be wrong, of course, and it probably depends on the use case. For me, I see evals as a form of code for the LLM, so I want an interface that makes it easy to scale and maintain.

Nice! I don’t see my non-technical teammates ever opening Cursor, haha.

All good. I think it really comes down to the use case and context. I’d love to keep collaborating and sharing ideas as this space evolves.

zachdaniel · June 24, 2025, 4:38pm

In this case it’s not even meant to be used as a library. I want to make it a central tool we can run to evaluate LLM assistants skill with Elixir and its package ecosystem.

bradley · July 8, 2025, 7:26pm

Just want to update everyone, I’ve been working on ex_eval in the background and think I found a pattern that I like. In summary:

Async-first with OTP supervision: Built on GenServer processes with proper supervision trees. You can run evaluations asynchronously and get real-time progress updates.
LLM-as-judge pattern: Instead of rigid assertion-based testing, evaluations use natural language criteria judged by LLMs. This is implemented using an adapter approach, starting with langchain as the base layer. (See ex_eval_langchain for the adapter if you’re curious)
Dataset protocol for flexible data sources: Whether your evaluation data comes from CSV files, databases, or inline definitions, the Dataset protocol provides a consistent interface.
Pipeline processing: Built-in support for preprocessors, postprocessors, and middleware. This lets you transform data, handle multi-turn conversations, and add custom evaluation logic without modifying the core framework.
Real-time monitoring: Integration with real time messaging for broadcasting evaluation progress, which is helpful for long-running evaluation suites and UI interfaces.

In terms of next steps, from here, I think I’m at the point where I can build a UI interface that is agnostic to the data layer. That way I can move on to implementing an Ash resource layer and then maybe other folks who don’t use Ash could implement Ecto for example.

Here is an example script (link) that I’ve landed on that’s inspired by Req and MLFlow:

Mix.install([
  {:ex_eval, path: "./", override: true},
  {:ex_eval_langchain, path: "../ex_eval_langchain"}
])

dataset = [
  %{
    input: "What is the capital of France?",
    judge_prompt: "The answer should be Paris",
    category: :geography
  }
]

response_fn = fn
  "What is the capital of France?" ->
    "Paris"

  _ ->
    "I don't know"
end

ExEval.new()
|> ExEval.put_judge(ExEval.Langchain, model: "gpt-4.1-mini")
|> ExEval.put_dataset(dataset)
|> ExEval.put_response_fn(response_fn)
|> ExEval.put_experiment(:langchain)
|> ExEval.run(async: false)

Because everything is implemented as composible functions, we could in theory add whatever macros/syntactic sugar we want to streamline the overall DX.

Also, this project is really a proposal of sorts. I really only plan to add it to hex if people want it.

@zachdaniel just curious, are there any issues you think I’ll run into on the ash side implementing the data layer portion of this? Any considerations I should take into account?

zachdaniel · July 8, 2025, 7:55pm

Do you mean like you’d provide some kind of prebuilt backend/frontend for this stuff?

zachdaniel · July 8, 2025, 7:56pm

Personally I’d want to see both of these things.

bradley · July 8, 2025, 8:07pm

I’m thinking a prebuilt frontend with a pluggable backend.

Makes sense.

zachdaniel · July 8, 2025, 8:29pm

Sorry I realized maybe that wasn’t clear but I mean I’d want to see deterministic assertions in addition to an LLM judge.

I’m thinking a prebuilt frontend with a pluggable backend.

I’d have to know more about the goals of it on that front, but pluggable backend often means some kind of strictly defined API layer at which point other adapters could be provided perhaps, but if you’re in charge of both the front end and backend it sounds like perhaps complexity that isn’t warranted?

bradley · July 8, 2025, 9:47pm

No worries, I know just what you mean. It’s a common pattern for eval frameworks to support different evaluation methods/metrics outside of LLM as a judge.

For the “pluggable backend” piece, I’m actually hoping the community can land on something foundational and reusable, think how Ecto is used for persistence, but Ash and others build on top. I’d prefer this evaluation code isn’t tied to a single framework like Ash, but rather lives as a shared library anyone can leverage (and extend with adapters if needed).

To be honest, my goal is a bit selfish: I don’t want to build and maintain this myself, but I haven’t found anything that fits my requirements yet. So, if I have to, I’ll put together something minimal to get by. David from Thinking Elixir just mentioned a similar idea in the latest episode, would love for him to chime in here if he’s interested!