safety medium complexity mvp

Evaluation

Evaluation measures whether an AI system behaves correctly on realistic examples.

Decision

Add evaluation before you optimize, scale, or trust an AI feature with important user outcomes.

Use when

RAG answer quality
Agent reliability
Regression testing prompts
Comparing model or retrieval changes

Avoid when

Vague goals with no success criteria
One-off demos
Purely subjective taste without examples
Teams unwilling to inspect failures

Why evaluation comes early

Evaluation is how you stop guessing. It gives you a repeatable way to compare prompts, models, retrieval settings, tools, and guardrails.

For MVPs, evaluation can be lightweight: a spreadsheet of realistic examples, expected properties, and failure notes. The important part is that changes can be compared against the same cases.

What to evaluate

Evaluate the product behavior, not just the model output. In RAG, inspect retrieved sources and final answers. In agents, inspect tool calls, stopping behavior, and recovery from failed steps.

Common mistakes

Waiting until production to evaluate.
Measuring only fluency.
Ignoring negative examples and edge cases.
Changing prompts without regression checks.

Next decision

Define what “good” means before choosing optimizations. Evaluation should guide RAG tuning, agent design, and fine-tuning decisions.