Skip to content

Evaluate AI Models

This guide walks you through a systematic approach to evaluating AI models for production deployment using Lattice.

Model evaluation involves:

  1. Defining your requirements
  2. Gathering relevant documentation
  3. Comparing capabilities and limitations
  4. Testing with your specific use cases
  5. Making a data-driven decision

Before evaluating models, clarify what matters most:

  • Latency: What response time is acceptable?
  • Throughput: How many requests per second?
  • Context window: How much input do you need to process?

Create a scenario in Lattice to capture these requirements:

name: Model Evaluation - RAG Chatbot
workload_type: rag
traffic_profile: medium_volume
slo_requirements:
p95_latency_ms: 1000
throughput_rps: 50
budget:
monthly_limit_usd: 3000

Apply relevant blueprints to populate your workspace:

  1. Go to Blueprints and apply vendor blueprints for models you’re considering
  2. Add any additional sources (benchmarks, blog posts, comparison articles)
  3. Wait for indexing to complete

For a Claude vs GPT-4 evaluation, you might apply:

  • Anthropic Claude Blueprint
  • OpenAI GPT-4 Blueprint
  • Add recent benchmark URLs manually

Ask Lattice to create a structured comparison:

Create a detailed comparison table of Claude Sonnet and GPT-4 Turbo
for RAG applications. Include:
- Context window size
- Input/output pricing
- Latency benchmarks
- Key strengths
- Known limitations

Save the generated table to Studio for reference.

With your scenario active, ask targeted questions:

Given my RAG chatbot scenario with P95 latency under 1000ms
and $3000/month budget, which model is the better choice?
Show your reasoning with citations.

Lattice will consider your specific constraints when recommending.

Investigate areas that matter most to your use case:

What are the typical first-token and total response latencies
for Claude Sonnet vs GPT-4 Turbo? Include any regional variations.
Calculate the monthly cost for each model assuming:
- 1 million input tokens/day
- 200K output tokens/day
- 30 days
Include prompt caching benefits if available.
Compare Claude Sonnet and GPT-4 Turbo on:
- Instruction following accuracy
- Factual grounding
- Handling of ambiguous queries
Cite relevant benchmarks and studies.

Generate an executive memo summarizing your evaluation:

Write an executive memo recommending a model choice for our
RAG chatbot project. Include:
- Summary of options evaluated
- Key decision criteria
- Recommended choice with rationale
- Cost projections
- Risk considerations

Save this artifact to Studio as your decision record.

Here’s what a typical evaluation artifact might look like:

CriterionClaude SonnetGPT-4 TurboWinner
Context Window200K tokens128K tokensClaude
Input Price$3/1M tokens$10/1M tokensClaude
Output Price$15/1M tokens$30/1M tokensClaude
P95 Latency~800ms~1200msClaude
ReasoningExcellentExcellentTie
Function CallingGoodExcellentGPT-4
SafetyExcellentGoodClaude