LatticeSmart AI System Decisions

stacks infrastructure hardware advanced

Configuring Spot Instance Strategy

Lattice Lab • December 9, 2025 • 10 min read

Configuring Spot Instance Strategy

When I need to run a multi-day training job on a budget, I want to configure spot instances with proper checkpointing, so I can cut costs by 50-60% without losing training progress to interruptions.

The Challenge

Your TCO analysis shows self-hosted training makes sense at your volume. The next question: how do you actually implement it without burning money on on-demand instances? Spot instances offer 60-70% savings, but the “spot” in spot instances means they can disappear with 2 minutes notice. For a 5-day training run, that’s not an inconvenience—it’s a project killer without proper planning.

The naive approach is to just launch spot instances and hope for the best. The sophisticated approach involves checkpoint configuration, fleet diversification, fallback strategies, and interruption handling—all while keeping the economics favorable. Getting this right saves $100K+ annually on a typical training workload. Getting it wrong means lost compute, failed runs, and frustrated engineers.

This walkthrough shows how to use the Spot Instance Advisor to configure a production-ready spot strategy for training, from workload assessment through checkpoint configuration to fleet composition.

The Starting Point: Your Training Setup

You’re the ML infrastructure lead planning compute for a 70B model training run:

Model: 70B parameters, requiring 8x H100 GPUs
Duration: 5 days estimated (120 hours)
Checkpoint Size: 280 GB (full model state)
Budget Pressure: Leadership wants to cut the $57K compute bill
Constraint: Can’t afford to lose more than 4 hours of training progress

Your goal: reduce costs by 50%+ while maintaining reliable training completion.

Step 1: Assess Workload Suitability

Open the Spot Instance Advisor from the Tools section in the Studio panel.

Spot Instance Advisor showing training workload configuration and strategy recommendation

Select your workload type:

Click Training in the workload selector
The advisor marks training as “Excellent” suitability (95% base score)

Review the suitability reasoning:

Training suitability: 95%
- Highly checkpointable workload
- Can resume from last checkpoint on interruption
- Proven pattern in large-scale training runs

Note: Long duration (>7 days) or large clusters (>64 GPUs)
may reduce suitability due to cumulative interruption exposure.

Your 5-day run and 8-GPU cluster are well within the optimal range.

Step 2: Configure Risk Tolerance

Select your risk tolerance:

Consider your constraints:

You can’t afford to lose >4 hours of progress
But you need significant cost savings
You have engineers who can monitor the run

Select Medium risk tolerance

The advisor explains what this means:

Medium Risk Tolerance:
- Expects occasional interruptions (1-2 per week)
- Willing to accept 5-15 minute recovery windows
- Prefers spot savings with reliability safety net
- Recommended strategy: Spot with Fallback (90/10)

Step 3: Configure GPU and Training Parameters

Set GPU configuration:

Select Cloud Provider: AWS
Select GPU Type: H100 80GB
Set GPU Count: 8

Set training duration:

Set Duration: 120 hours (5 days)

Set checkpoint configuration:

Set Checkpoint Size: 280 GB
Set Checkpoint Frequency: 30 minutes (your current setting)

The advisor calculates based on your inputs.

Step 4: Review the Strategy Recommendation

The advisor returns its recommendation:

Viability: Recommended

Strategy: Spot with Fallback (90/10)

Fleet Composition:
- Spot Instances: 90% (7.2 GPU equivalents)
- On-Demand Fallback: 10% (0.8 GPU equivalents)

Fallback Behavior:
- Grace period: 120 seconds
- Auto-failover on spot interruption
- Resume from checkpoint automatically

Why not pure spot? The advisor explains:

With medium risk tolerance and 120-hour duration,
pure spot (100%) carries 94% cumulative interruption
probability. Spot with fallback provides 90% of the
savings while guaranteeing compute availability
during checkpoint recovery.

Step 5: Review Savings Analysis

Cost Comparison:

Metric	On-Demand	Spot Strategy
Hourly Cost	$786.56	$314.62
5-Day Cost	$47,194	$18,877
Monthly Cost	$574,186	$229,674
Annual Savings	-	$4.1M

Savings Breakdown:

Base H100 Spot Discount: 65%
Strategy Effectiveness: 90% (due to 10% fallback)
Effective Discount: 58.5%

Your savings: $28,317 on this training run
Annualized (if continuous training): $4.1M

Step 6: Configure Checkpointing

The advisor provides checkpointing recommendations:

Checkpoint Configuration:

Recommended Interval: 30 minutes
  (Matches your setting - optimal for spot interruption warning)

Storage: s3://your-bucket/checkpoints/
  - S3 Standard for fast resume
  - Lifecycle: Delete after 30 days

Resume Time: ~5 minutes
  (280 GB at 1 GB/s restore speed)

Storage Cost: $65/month
  (280 GB x multiple versions)

Why 30 minutes?

Spot interruption warning: 120 seconds
Checkpoint save time: ~3 minutes for 280 GB
Recovery time: ~5 minutes

With 30-minute intervals, maximum lost work is 30 minutes.
Your constraint (4 hours max loss) is easily met.

Shorter intervals (15 min) add checkpoint overhead.
Longer intervals (60 min) risk more lost work.

Step 7: Review Interruption Risk

Risk Assessment:

Metric	Value
Interruption Frequency	Occasional (1-2 per week)
Warning Time	120 seconds
Spot Availability	94% for H100
Cumulative Probability (5 days)	~3 interruptions expected

Mitigation Strategies:

Automatic checkpoint on SIGTERM: Catch the 2-minute warning
Multi-AZ deployment: Spread across availability zones
CloudWatch alerts: Notify on interruption events
Async checkpoint to S3: Don’t block training for checkpoint I/O

Step 8: Review Fleet Configuration

Diversification Strategy: capacity-optimized

Instance Types: [p5.48xlarge]
  - H100 instances for your GPU requirement

Availability Zones: [us-east-1a, us-east-1b, us-west-2a]
  - Multi-AZ reduces correlated interruptions

Spot Pool Selection: capacity-optimized
  - Prioritizes pools with highest availability
  - Lower interruption rate than lowest-price

Why capacity-optimized over lowest-price?

For training workloads, consistency matters more than
marginal price differences. Capacity-optimized selects
from pools with the most available capacity, reducing
interruption frequency by 20-40% vs lowest-price strategy.

Step 9: Export and Implement

Save as Artifact:

Click Save as Artifact in the modal header
The artifact captures the complete recommendation

Copy Config: Click Copy Config to get the fleet configuration JSON for your infrastructure automation.

Step 10: Implement the Strategy

With your configuration exported, implement in your training infrastructure:

1. Configure checkpoint handler:

import signal

def checkpoint_on_sigterm(signum, frame):
    print("Received SIGTERM - saving checkpoint")
    trainer.save_checkpoint("s3://checkpoints/emergency/")
    sys.exit(0)

signal.signal(signal.SIGTERM, checkpoint_on_sigterm)

2. Configure AWS Spot Fleet:

SpotFleetRequestConfig:
  AllocationStrategy: capacityOptimized
  TargetCapacity: 8
  OnDemandTargetCapacity: 1  # 10% fallback
  LaunchTemplateConfigs:
    - LaunchTemplateSpecification:
        LaunchTemplateId: lt-xxx
      Overrides:
        - InstanceType: p5.48xlarge
          AvailabilityZone: us-east-1a
        - InstanceType: p5.48xlarge
          AvailabilityZone: us-east-1b

3. Configure checkpoint schedule:

# In your training loop
if step % checkpoint_interval == 0:
    trainer.save_checkpoint(f"s3://checkpoints/step-{step}/")

Real-World Patterns

Pattern: Validating Before Long Runs

Before committing to a 5-day run:

Run a 4-hour test with spot strategy
Verify checkpoint save/restore works
Confirm interruption handling triggers correctly
Monitor actual interruption frequency vs predicted

Pattern: Adjusting for High-Priority Runs

For critical deadline-driven training:

Increase fallback percentage (70/30 mixed fleet)
Reduce checkpoint interval to 15 minutes
Accept 40% savings instead of 58% for more reliability

Pattern: Batch Inference Optimization

For embedding generation or batch inference:

Select “Batch” workload type (98% suitability)
Choose “High” risk tolerance (stateless, retryable)
Use “Pure Spot” strategy (100%)
No checkpointing needed—just retry failed batches

What You’ve Accomplished

You now have a production-ready spot strategy:

Assessed workload suitability (95% for training)
Configured fleet composition (90/10 spot/on-demand)
Set checkpoint intervals optimized for interruption recovery
Estimated savings: $28K per 5-day run, $4.1M annually

What’s Next

Your spot strategy feeds into other Lattice tools:

TCO Calculator: Update self-hosted costs with actual spot pricing
Stack Configuration: Apply spot fleet settings to infrastructure stack
Training Scenarios: Save spot configuration as part of training scenario
Memory Calculator: Verify checkpoint memory doesn’t exceed GPU capacity

Spot Instance Advisor is available in Lattice. Cut your training costs in half without cutting corners.

Ready to Try Lattice?

Get lifetime access to Lattice for confident AI infrastructure decisions.

Get Lattice for $99