Aludel - LLM prompt evaluation workbench

Aludel - LLM Evaluation Workbench

Aludel is an embeddable Phoenix LiveView dashboard for evaluating and comparing LLM prompts across multiple providers (OpenAI, Anthropic, Ollama) simultaneously. It helps developers test prompt quality, track costs, and catch regressions with automated evaluation suites.

What it does

Run the same prompt across different LLM providers side-by-side and compare:

  • Output quality — See responses from GPT-4, Claude, and local Ollama models together
  • Performance metrics — Latency, token usage, and cost per request tracked in real-time
  • Evolution tracking — Visualize how prompt versions perform over time with pass rates, cost, and latency trends
  • Regression testing — Automated evaluation suites with assertions (contains, regex, exact_match, json_field)
  • Prompt versioning — Immutable prompt versions with {{variable}} interpolation

Key features

  • Multi-provider execution — Send one prompt to OpenAI, Anthropic, and Ollama concurrently. Results stream in real-time.
  • Cost tracking — Automatic cost calculation based on token usage and provider pricing.
  • Evaluation suites — Visual test case editor with document attachments (PDF, images, CSV, JSON, TXT). Run automated assertions against LLM responses.
  • Dashboard — Live metrics as runs execute: cost trends, latency, and per-provider performance.
  • Local-first option — Works with Ollama out of the box (no API keys required). Add cloud providers optionally.
  • Embeddable — Add to any existing Phoenix LiveView app as a self-contained dashboard, or run standalone.

Example workflow

# 1. Create a versioned prompt template
"Explain {{topic}} in exactly 3 sentences."

# 2. Run across 3 providers simultaneously
#    - Ollama (llama3, local)
#    - OpenAI (gpt-4o)
#    - Anthropic (claude-sonnet-4)

# 3. View side-by-side comparison in real-time:
# Provider       | Latency | Tokens  | Cost     | Output
# Ollama Llama3  | 1,234ms | 45/123  | $0.0000 | ...
# OpenAI GPT-4o  | 856ms   | 52/145  | $0.0019 | ...
# Claude Sonnet  | 1,102ms | 48/138  | $0.0018 | ...

# 4. Create evaluation suite with assertions
#    - Assert output contains "three sentences"
#    - Assert output matches regex pattern
#    - Run regression tests on prompt changes

Use cases

  • Prompt engineering — Test variations across providers to find the best prompt/model combination
  • Cost optimization — Compare pricing and quality trade-offs between providers
  • Quality assurance — Automated regression testing when updating prompts or switching providers
  • Provider evaluation — Benchmark performance, cost, and quality across OpenAI, Anthropic, and local models
  • Offline development — Use Ollama for local development without API costs

Installation

Aludel can be embedded into any Phoenix LiveView application or run standalone.

As a dependency (embedded mode)

# mix.exs
def deps do
  [
    {:aludel, "~> 0.1"}
  ]
end

# config/config.exs
config :aludel, repo: YourApp.Repo

# lib/your_app_web/router.ex
import Aludel.Web.Router

scope "/dev" do
  pipe_through :browser
  aludel_dashboard "/aludel"
end
mix aludel.install  # Copy migrations
mix ecto.migrate
mix aludel.seed     # Optional demo data

Standalone mode

git clone https://github.com/ccarvalho-eng/aludel.git
cd aludel/standalone
mix deps.get
mix ecto.setup
mix aludel.seed  # Optional demo data
mix phx.server
# Visit http://localhost:4000

Requirements: Elixir 1.19.5+, Erlang/OTP 28.4+, PostgreSQL 17+

Optional: ImageMagick v7+ (for PDF support with Ollama vision models)

Current status

Active development. Core features complete. Available on Hex.pm with CI/CD, and security scanning.

:white_check_mark: Multi-provider execution (OpenAI, Anthropic, Ollama)
:white_check_mark: Real-time result streaming with LiveView
:white_check_mark: Cost and latency tracking
:white_check_mark: Prompt versioning and evolution tracking
:white_check_mark: Evaluation suites with document attachments
:white_check_mark: Side-by-side comparison UI

Links

9 Likes

Minor updates:

  1. Dashboard Revamp - Glass morphism styling
  2. Improved Dark Mode - Updated to One Dark color palette for better contrast
  3. Prompt Evolution - New evolution tab for tracking prompt performance over time (kudos to @mikehostetler for the idea i.e. feat: integrate GEPA for prompt evolution and optimization · Issue #12 · ccarvalho-eng/vial · GitHub )
  4. Branding Updates - Added beaker icon to navigation and favicon
  5. UI Polish - Modernized button design, improved modals, consistent styling across pages

1 Like

Update: added some minor charts to the prompt evolution page

1 Like

This looks pretty amazing, I will make sure to try it out

Any plans to add other providers?

Also, do you plan in making this a library in the future? For now it seems like it is its own Phoenix project right? Being able to add it to an existing project would be great.

Any plans to add other providers?

We can! Maybe we’d need to modularize the LLM client interface a bit but tottally doable!

Also, do you plan in making this a library in the future? For now it seems like it is its own Phoenix project right? Being able to add it to an existing project would be great.

If this becomes something super useful, and more people are interested, why not? You’re suggesting something like Oban Web/ Live Dashboard and lock it into a dev route, right? Right now, the only way to use it is to have it running on your computer or put into a private service or something

Feel free to open an issue and start a conversation there about your ideas

Nice.

I look a little bit in the code and noticed that you are using Req to do the LLM requests right? Any reason for now using ReqLLM instead? Technically it would give you support for a bunch of other agents in a common interface.

Yep exactly, I can see me adding it to my projects like a dev only route that I can use it to prototype and test prompts

Fair points.

I noticed ReqLLM requires an api_key param even for Ollama so no strong preference. Mostly wanted to have an MVP up and running to show if people would like it. I can eventually push a PR to ReqLLM to patch this.

Not sure how much effort it would take to make this prompt lab “bootable” but I can dig into specifics.

1 Like

Hey @sezaru, good news! I’ve got this working in a branch feat: convert Vial to embeddable Phoenix LiveView library by ccarvalho-eng · Pull Request #19 · ccarvalho-eng/vial · GitHub

Have been banging my head and could only make it work after spending some time learning more about oban web internal architecture.

It’s a massive branch but it’s working. I’ll try to polish it and reduce size but it’s unlikely that it will be a thiner PR.

Let me know what do you think and perhaps see if there’s a way we could simplify some aspects.

Standalone and embedded mode are working seamlessly! will keep branch open for a few days until i am fully confident about code quality and QA

Updates

  • Standalone + Embeddable is merged to main. Please report any bugs.
  • I am thinking on rebranding as Vial is too generic and it’s already taken on hex.pm

Potential logo:

  1. Flask (aludel) → controlled experiment → “workbench”
  2. Smoke → partial wireframe robot → model under evaluation, not finished output
  3. Wireframe + fade → abstraction, inspection, iteration
1 Like

Update: Library was just published to hex.pm!

2 Likes

Final logo (for better or for worse :slight_smile:)

3 Likes

Prompts now have projects (folders) so we can organize them

Will eventually add more providers when I switch back to ReqLLM

Have been playing with document evaluation …

Update: version 0.1.7. is up! it added better editing test suites + document uploads along with json field asserts

1 Like

For now, I am storing docs using PostgreSQL (max limit of 10MB)

1 Like

Aludel v0.1.8 Released - LLM Eval Workbench Updates

What’s New:

  • Visual/JSON toggle for assertion editors with dynamic field switching
  • Improved run configuration layout with side-by-side template preview
  • Fixed PubSub supervisor initialization

Aludel v0.1.9 Released - Enhanced Dashboard with Activity Charts & Metrics

This release focuses on dashboard improvements to help you track prompt performance and
costs more effectively (hopefully stats are a bit more meaningful now but I am no data analyst :sweat_smile:):

New Dashboard Features:

  • :bar_chart: Activity chart - Interactive 30-day visualization with hover tooltips
  • :chart_increasing: Trend indicators - 7-day comparison arrows showing run volume changes
  • :money_bag: Cost breakdowns - Toggle between provider and prompt views
  • :high_voltage: Latency percentiles - P50/P95 metrics alongside averages
  • Suite run failures highlighting

Thanks. I will begin testing this out in the next couple of weeks.

1 Like