scenarios ai-research getting-started

Creating Your First Model Evaluation

Lattice Lab 10 min read

When I ship an AI pipeline, I want to measure quality systematically, so I can catch regressions and justify improvements with data.

The Challenge

Your team shipped a RAG pipeline last month. Users report that some answers are “wrong” or “incomplete,” but without a systematic way to measure quality, you can’t tell if the problem is the retrieval, the model, or the prompt template. When someone suggests trying GPT-4o instead of Claude, you have no baseline to compare against.

This is the evaluation gap that plagues ML teams. You have intuitions about model quality—“Claude seems better at reasoning”—but no data to back them up. Running a few test prompts by hand doesn’t scale, and building custom evaluation harnesses takes weeks of engineering time.

How Lattice Helps

The Evaluation Framework lets you define evaluations once and run them repeatedly as models, prompts, and pipelines evolve. Instead of ad-hoc testing, you configure structured evaluations with clear targets, scoring methods, and success criteria.

This journey walks through creating your first evaluation, from opening the creation modal to configuring LLM-as-Judge scoring for a custom test set.

Step 1: Navigate to Evaluations

Open the Studio panel on the right side of the interface. Scroll to the Evaluations section—it shows any existing evaluations in your workspace.

Click the + Create Evaluation button to open the creation modal.

Step 2: Configure Basic Information

Name your evaluation:

RAG Pipeline Quality Assessment

Add a description:

Evaluate answer quality, factual accuracy, and
completeness for our internal documentation Q&A system

Clear naming matters when you accumulate dozens of evaluations.

Step 3: Select Evaluation Type

TypeWhen to Use
BenchmarkMeasure general model capabilities
Task-SpecificEvaluate your specific use case
ComparisonHead-to-head model comparison
OperationalProfile latency, throughput, cost
SafetyTest for toxicity, bias, jailbreaks

For evaluating your RAG pipeline’s answer quality, select Task-Specific.

Step 4: Select Evaluation Targets

Choose what you’re evaluating:

  • Models: Individual LLM providers
  • Stacks: Complete infrastructure configurations
  • Scenarios: Workload definitions with SLOs

For a single-system evaluation: Click your current RAG stack to add it as the target.

For a comparison evaluation: Select multiple targets (2-4) to compare side-by-side.

Step 5: Define Test Inputs

Manual Entry:

[
{
"input": "What is our refund policy for enterprise customers?",
"reference": "Enterprise customers receive full refunds within 90 days of purchase."
},
{
"input": "How do I enable SSO with Okta?",
"reference": "Navigate to Settings > Security > SSO. Click 'Add Provider' and select Okta."
},
{
"input": "What are the rate limits for the API?",
"reference": "Free tier: 100 requests/day. Pro tier: 10,000 requests/day."
}
]

Bulk Import: Click Import to upload a JSON or CSV file with your test set.

Step 6: Configure Scoring Method

MethodDescription
Exact MatchBinary correct/incorrect
Fuzzy MatchLevenshtein distance threshold
Semantic SimilarityEmbedding cosine similarity
LLM-as-JudgeLLM evaluates response quality

For evaluating natural language quality, select LLM-as-Judge.

Step 7: Configure LLM-as-Judge

Scale Type:

Likert 5-point (1-5 rating scale)

Comparison Mode:

Reference (compare against reference answer)

Scoring Prompt:

You are evaluating an AI assistant's response quality.
Question: {{input}}
Response: {{output}}
Reference Answer: {{reference}}
Evaluate the response on these criteria:
1. Factual accuracy compared to reference
2. Completeness of information
3. Clarity and organization
Rate the response from 1-5.
Provide your rating as JSON: {"score": N, "reasoning": "..."}

Rubric (Optional):

CriterionWeightDescription
Accuracy50%Factual correctness vs reference
Completeness30%Covers all relevant points
Clarity20%Well-organized, easy to understand

Judge Model:

claude-3-5-haiku-20241022 (Anthropic)

Haiku is recommended for judging—it’s cost-effective while maintaining evaluation quality.

Step 8: Configure Methodology

SettingRecommendedDescription
Sample Size50-100Number of inputs to evaluate
Confidence Level95%For confidence interval calculation
Concurrency10Parallel requests to LLM
Timeout60sPer-request timeout
Random Seed42For reproducibility

Step 9: Review and Create

The final step shows a summary:

Evaluation: RAG Pipeline Quality Assessment
Type: Task-Specific
Targets: 1 (Production RAG Stack)
Test Inputs: 50
Scoring: LLM-as-Judge (Likert-5, Reference mode)
Judge: claude-3-5-haiku-20241022

Click Create Evaluation to save.

The evaluation is created in Pending status, ready to run when you are.

Real-World Scenarios

A product team validating prompt changes creates a task-specific evaluation with 100 production questions. Before deploying a new system prompt, they run the evaluation against both prompts and compare scores.

An ML engineer comparing embedding models creates a comparison evaluation targeting their RAG stack with two different embedding configurations. The evaluation runs identical questions against both.

A research scientist evaluating reasoning creates a benchmark evaluation with GSM8K (math) subset. They target three models and compare accuracy.

What You’ve Accomplished

Your pending evaluation is ready to run with:

  • Clear targets and test inputs defined
  • LLM-as-Judge scoring configured
  • Statistical methodology set
  • Reproducibility ensured via random seed

What’s Next

The evaluation workflow continues with:

  • Run Evaluation: Execute and monitor progress
  • View Results: See scores, statistics, and breakdowns
  • Compare Results: Side-by-side visualization with statistical significance

Evaluation Framework is available in Lattice. Build confidence in your AI systems with systematic measurement.

Ready to Try Lattice?

Get lifetime access to Lattice for confident AI infrastructure decisions.

Get Lattice for $99