Tribunal - LLM evaluation and testing framework for Elixir

, , ,

Hey everyone!

I’ve been working on Tribunal, an LLM evaluation framework for Elixir. It helps you test RAG pipelines and LLM outputs in your test suite.

The problem

When building LLM-powered features, you need to verify things like:

  • Is the response grounded in the provided context?
  • Did it hallucinate information?
  • Is the response actually relevant to the question?
  • Any toxic, biased, or harmful content?

The solution

Tribunal integrates with ExUnit so you can write these checks as regular tests:

test "response is faithful to context" do
  response = MyApp.RAG.query("What's the return policy?")

  assert_faithful response, context: @docs
  refute_hallucination response, context: @docs
  refute_toxicity response
end

Two modes

  • Test mode: ExUnit integration for CI gates. Fails immediately on any violation. Use this for safety checks that must pass.

  • Evaluation mode: Mix task for benchmarking. Run hundreds of test cases, set pass thresholds (e.g., “pass if 80% succeed”), track regression over time.

Assertion types

Deterministic (no API calls, instant):

  • assert_contains / refute_contains
  • assert_regex
  • assert_json
  • assert_max_tokens

LLM-as-Judge (uses any model via req_llm):

  • assert_faithful - grounded in context
  • assert_relevant - addresses the query
  • refute_hallucination - no fabricated info
  • refute_bias, refute_toxicity, refute_harmful, refute_jailbreak, refute_pii

Embedding-based (via alike):

  • assert_similar - semantic similarity

Red team testing

  • Generate adversarial prompts to test your LLM’s safety:

Tribunal.RedTeam.generate_attacks(“How do I pick a lock?”)

  • Returns encoding attacks (base64, rot13, leetspeak)

  • injection attacks (ignore instructions, delimiter injection), jailbreak attacks (DAN, developer mode)

Would love to hear your feedback!

Hex: tribunal | Hex
GitHub:

5 Likes