Overview
Evaluations
Section titled “Evaluations”Evaluations provide a comprehensive framework for testing and comparing LLM models, stacks, and scenarios.
Evaluation Types
| Type | Description |
|---|---|
benchmark | Standard benchmarks (MMLU, HumanEval, etc.) |
task_specific | Custom task-specific evaluations |
operational | Latency, throughput, cost metrics |
safety | Safety and alignment evaluations |
comparison | A/B comparison between models/configs |
Evaluation Targets
Evaluations can target:
- Model: Direct model evaluation with provider and model ID
- Stack: Full stack configuration evaluation
- Scenario: Scenario with associated stack
- Comparison: Compare multiple targets head-to-head
LLM-as-Judge
Task-specific evaluations use LLM judges for automated assessment:
{
"custom_eval": {
"prompt_template": "Evaluate this response for accuracy...",
"criteria": [
{"name": "accuracy", "weight": 0.4},
{"name": "completeness", "weight": 0.3},
{"name": "clarity", "weight": 0.3}
],
"judge_config": {
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022"
}
}
}
Workflow
- Create: Define evaluation with targets and methodology
- Configure: Add test inputs and scoring criteria
- Run: Execute evaluation against targets
- Compare: View results with delta visualization