Skip to content

Build Stacks

Stacks translate your scenario requirements into concrete infrastructure configurations ready for deployment.

Scenario → Stack → Deployment
(What you need) → (What to build) → (How to run it)

A stack answers: “Given my requirements, exactly what should I deploy?”

  1. Start with a Scenario

    Ensure you have a well-defined scenario:

    name: Production Chat
    workload_type: chat
    traffic_profile: high_volume
    slo_requirements:
    p95_latency_ms: 500
    throughput_rps: 200
    budget:
    monthly_limit_usd: 5000
  2. Request Stack Generation

    Ask Lattice to recommend a stack:

    Generate an optimized stack for my Production Chat scenario
    that prioritizes latency while staying within budget.
  3. Review Recommendations

    Lattice will explain its choices:

    Recommended: Claude Haiku Speed Stack
    Model: Claude 3.5 Haiku
    - P95 latency: ~300ms (meets 500ms SLO)
    - Cost: ~$2,100/month at 200 RPS (within $5K budget)
    Alternative: Claude Sonnet Quality Stack
    - P95 latency: ~600ms (exceeds SLO)
    - Cost: ~$8,400/month (exceeds budget)
  4. Refine if Needed

    Adjust based on priorities:

    I'm willing to go to $6000/month if it improves quality.
    What stack would you recommend then?
  5. Save the Stack

    Save your chosen configuration for deployment reference.

Choose your inference settings:

model:
provider: anthropic
model_id: claude-3-5-haiku-20241022
# Generation parameters
temperature: 0.3 # Lower for consistency
max_tokens: 1024 # Limit output length
top_p: 0.9 # Nucleus sampling
# Advanced options
stream: true # Enable streaming
stop_sequences: [] # Custom stop tokens

Select your orchestration layer:

framework:
# Orchestration choice
orchestration: langgraph # langgraph | langchain | custom
# Observability
observability: langsmith # langsmith | phoenix | custom
# Logging configuration
logging: structured # structured | json | plaintext
log_level: info # debug | info | warn | error
# Tracing
tracing: enabled
trace_sampling_rate: 0.1 # Sample 10% of requests
FrameworkBest ForComplexity
LangGraphAgentic workflows, complex stateHigher
LangChainStandard chains, quick prototypesMedium
CustomMaximum control, minimal overheadVariable

Define your infrastructure:

hardware:
# Cloud provider
cloud_provider: aws # aws | gcp | azure
region: us-east-1
# Instance configuration
instance_family: compute # general | compute | memory
# Scaling
auto_scaling: true
min_instances: 2
max_instances: 10
# Cost optimization
spot_instances: false # Risky for production

Add resilience with provider fallback:

fallback:
enabled: true
provider: openai
model_id: gpt-4-turbo
# Trigger conditions
triggers:
- error_rate > 5%
- latency_p99 > 2000ms
- provider_unavailable
# Automatic retry
auto_retry: true
max_retries: 3

For latency-critical applications:

name: Speed Stack
description: Optimized for minimum latency
model:
provider: anthropic
model_id: claude-3-5-haiku-20241022
temperature: 0.3
max_tokens: 512 # Shorter outputs
stream: true
framework:
orchestration: custom # Minimal overhead
observability: langsmith
logging: structured
hardware:
cloud_provider: aws
region: us-east-1 # Closest to users
instance_family: compute
auto_scaling: true
# Expected metrics
estimated_latency_p95: 250ms
estimated_monthly_cost: $2500

For accuracy-critical applications:

name: Quality Stack
description: Optimized for response quality
model:
provider: anthropic
model_id: claude-sonnet-4-20250514
temperature: 0.7
max_tokens: 4096
stream: true
framework:
orchestration: langgraph
observability: langsmith
logging: structured
tracing: enabled
hardware:
cloud_provider: aws
region: us-east-1
instance_family: general
auto_scaling: true
# Expected metrics
estimated_latency_p95: 800ms
estimated_monthly_cost: $8000

For budget-constrained applications:

name: Cost Stack
description: Optimized for minimal spend
model:
provider: anthropic
model_id: claude-3-5-haiku-20241022
temperature: 0.5
max_tokens: 1024
stream: true
framework:
orchestration: custom
observability: langsmith
logging: json
trace_sampling_rate: 0.01 # Minimal tracing
hardware:
cloud_provider: aws
region: us-west-2 # Sometimes cheaper
instance_family: general
spot_instances: true # 60-90% savings
auto_scaling: true
# Cost optimizations
prompt_caching: enabled # Up to 90% input savings
batch_processing: enabled # For non-urgent requests
# Expected metrics
estimated_latency_p95: 400ms
estimated_monthly_cost: $1200

For mission-critical applications:

name: HA Stack
description: 99.99% availability target
model:
provider: anthropic
model_id: claude-sonnet-4-20250514
temperature: 0.5
max_tokens: 2048
stream: true
fallback:
enabled: true
provider: openai
model_id: gpt-4-turbo
auto_retry: true
framework:
orchestration: langgraph
observability: langsmith
logging: structured
tracing: enabled
hardware:
cloud_provider: aws
region: us-east-1
multi_az: true # Cross-AZ redundancy
instance_family: general
auto_scaling: true
min_instances: 3 # Always available
# Expected metrics
estimated_availability: 99.99%
estimated_latency_p95: 700ms
estimated_monthly_cost: $12000

Ask Lattice to compare your options:

Compare Speed Stack vs Quality Stack for my Production Chat
scenario. Show:
- Latency tradeoffs
- Cost implications
- Quality differences
- Risk assessment

Once you have a stack configuration:

  1. Export the configuration as YAML or JSON
  2. Use it to configure your deployment (Docker, Kubernetes, etc.)
  3. Monitor against SLOs using your observability stack
  4. Iterate based on production data