Overview

Evaluations

Evaluations provide a comprehensive framework for testing and comparing LLM models, stacks, and scenarios.

Evaluation Types

Type	Description
`benchmark`	Standard benchmarks (MMLU, HumanEval, etc.)
`task_specific`	Custom task-specific evaluations
`operational`	Latency, throughput, cost metrics
`safety`	Safety and alignment evaluations
`comparison`	A/B comparison between models/configs

Evaluation Targets

Evaluations can target:

Model: Direct model evaluation with provider and model ID
Stack: Full stack configuration evaluation
Scenario: Scenario with associated stack
Comparison: Compare multiple targets head-to-head

LLM-as-Judge

Task-specific evaluations use LLM judges for automated assessment:

{
  "custom_eval": {
    "prompt_template": "Evaluate this response for accuracy...",
    "criteria": [
      {"name": "accuracy", "weight": 0.4},
      {"name": "completeness", "weight": 0.3},
      {"name": "clarity", "weight": 0.3}
    ],
    "judge_config": {
      "provider": "anthropic",
      "model": "claude-3-5-sonnet-20241022"
    }
  }
}

Workflow

Create: Define evaluation with targets and methodology
Configure: Add test inputs and scoring criteria
Run: Execute evaluation against targets
Compare: View results with delta visualization