Skip to content

Overview

Evaluations provide a comprehensive framework for testing and comparing LLM models, stacks, and scenarios.

Evaluation Types

TypeDescription
benchmarkStandard benchmarks (MMLU, HumanEval, etc.)
task_specificCustom task-specific evaluations
operationalLatency, throughput, cost metrics
safetySafety and alignment evaluations
comparisonA/B comparison between models/configs

Evaluation Targets

Evaluations can target:

  • Model: Direct model evaluation with provider and model ID
  • Stack: Full stack configuration evaluation
  • Scenario: Scenario with associated stack
  • Comparison: Compare multiple targets head-to-head

LLM-as-Judge

Task-specific evaluations use LLM judges for automated assessment:

{
  "custom_eval": {
    "prompt_template": "Evaluate this response for accuracy...",
    "criteria": [
      {"name": "accuracy", "weight": 0.4},
      {"name": "completeness", "weight": 0.3},
      {"name": "clarity", "weight": 0.3}
    ],
    "judge_config": {
      "provider": "anthropic",
      "model": "claude-3-5-sonnet-20241022"
    }
  }
}

Workflow

  1. Create: Define evaluation with targets and methodology
  2. Configure: Add test inputs and scoring criteria
  3. Run: Execute evaluation against targets
  4. Compare: View results with delta visualization