Evaluate systematically

AI systems behave differently from traditional software: the same input can produce different outputs, there’s rarely a single correct answer, and a change that improves one metric can silently degrade another. Systematic evaluation is how you measure quality, detect regressions before they reach users, and build confidence that your system is actually improving over time. Braintrust supports evaluation at every stage of AI development — from rapid iteration in the browser to systematic experiments to continuous production monitoring. The full evaluation cycle:

Iterate in playgrounds — Prototype prompts, models, scorers, or custom agent code
Promote to an experiment — Lock in an immutable snapshot when you find a good configuration
Automate in CI/CD — Run evals on every pull request to catch regressions
Score in production — Monitor live traffic continuously with online scoring rules
Feed back — Pull interesting production traces into datasets to improve offline test coverage

Offline evaluation

Offline evaluation runs against known datasets before deployment. Because you control the inputs and can define expected outputs, you can use code-based scorers or LLM-as-a-judge — and results are reproducible and comparable over time.

Iterate in playgrounds

Playgrounds are a browser-based environment for rapid iteration. Run evaluations in real time, compare configurations side by side, and share results with teammates via URL. Results are mutable — re-running a playground overwrites previous generations, which is ideal for fast iteration. When your task can’t be expressed as a prompt, connect custom agent code to the playground via remote evals or sandboxes. The iteration workflow stays the same. When you’ve found a good configuration, promote it to an experiment to capture an immutable snapshot.

Run experiments

Experiments are the immutable, comparable record of your eval runs. Run them from code or in the UI, track progress over time, and integrate into CI/CD to catch regressions before they reach production.

Online evaluation

Online scoring evaluates production traces automatically as they’re logged, running asynchronously with no impact on latency. Because there’s no ground truth for live requests, it relies on LLM-as-a-judge scorers to assess quality. Use it to monitor for regressions, catch edge cases you haven’t seen before, and surface real user interactions that become new test cases.

Anatomy of an evaluation

Every evaluation has three parts:

Data — a dataset of test cases with inputs, optional expected outputs, and metadata. Build datasets from production logs, user feedback, or manual curation.
Task — the function being evaluated. Typically an LLM call, but can be any logic: a multi-step agent, a retrieval pipeline, or a custom workflow.
Scorers and classifiers — scorers measure quality with numeric scores, while classifiers apply categorical labels. To build them, use built-in autoevals, LLM-as-a-judge, or custom code.

Next steps

Test prompts and models in the playground
Test complex agents in the playground via remote evals or sandboxes
Run experiments with the SDK or in the UI
Run in CI/CD to catch regressions automatically
Score production traces with online scoring rules
Best practices for reliable evaluations

​Offline evaluation

​Iterate in playgrounds

​Run experiments

​Online evaluation

​Anatomy of an evaluation

​Next steps