LatticeSmart AI System Decisions

stacks infrastructure hardware intermediate

Selecting Cloud GPU Instances

Lattice Lab • December 9, 2025 • 9 min read

Selecting Cloud GPU Instances

When I need to provision GPU instances for inference, I want to systematically compare options across clouds, so I can select the most cost-effective instance that meets my performance requirements.

The Challenge

You need to provision GPU instances for a new inference deployment. Your requirements: enough memory to run Llama 70B quantized, enough throughput to serve 100 requests/minute, and cost under $10K/month. Simple requirements, but the cloud pricing pages make it anything but simple.

AWS offers p5, p4d, p4de, g5, g6, and g4dn instances—each with different GPU models, counts, and pricing. GCP has a2 and a3 families with different naming conventions. Azure uses yet another scheme. And that’s just NVIDIA GPUs. What about TPUs for training? Trainium for cost-optimized inference? The comparison matrix becomes unmanageable.

This walkthrough shows how to use the Accelerator Registry to navigate cloud GPU options, compare specifications, and select the optimal instance for your workload.

The Starting Point: Your Requirements

You’re the platform lead planning inference infrastructure:

Model: Llama 70B with AWQ 4-bit quantization (~17.5 GB)
Serving: vLLM with tensor parallelism
Throughput: 100 requests/minute
Latency: P95 < 2 seconds
Budget: $10K/month maximum
Cloud: Flexible between AWS, GCP, Azure

Your goal: identify the most cost-effective instance that meets your requirements.

Step 1: Open the Registry

Click the Registries icon in the logo bar of the Sources panel. The Registry Viewer modal opens.

Accelerator Registry showing instance comparison

Navigate to Accelerators:

Click the Accelerators tab in the modal header
The instance table loads with all cloud GPU/TPU instances

Step 2: Filter by Requirements

Your quantized model is 17.5 GB. With KV cache and overhead, you need ~24 GB per GPU minimum. Filter to relevant instances.

Search for specific accelerators:

Type “A100” in the search box
The table filters to A100 instances
Clear search and try “L4”

Note the per-GPU memory for each type:

H100 80GB: 80 GB per GPU
A100 80GB: 80 GB per GPU
A100 40GB: 40 GB per GPU
L4: 24 GB per GPU

L4 at 24 GB is tight but workable for your 17.5 GB quantized model.

Step 3: Compare Instance Classes

Let’s compare three instance classes that could work:

Option A: Single L4 (g6.xlarge)

1x L4, 24 GB memory, $0.81/hour
Monthly: $591

Option B: Single A100 40GB (a2-highgpu-1g)

1x A100 40GB, 40 GB memory, $3.67/hour
Monthly: $2,679

Option C: Single A100 80GB (p4de partial)

Only 8-GPU instances available (~$41/hour)
Too expensive for single-model inference

Step 4: Validate Throughput Capability

L4 is cheap, but can it hit 100 requests/minute?

Throughput calculation:

100 requests/minute = 1.67 requests/second
Average output: 300 tokens
Required throughput: 1.67 x 300 = 500 tokens/second

Single L4 capacity: ~20 tokens/second
GPUs needed: 500 / 20 = 25 L4 GPUs

A single L4 won’t meet throughput. Let’s check A100.

Step 5: Analyze A100 Instance

A100 Inference Performance:

~50-60 tokens/second for 70B quantized
500 / 55 = ~9 A100 GPUs needed

Pricing (a2-highgpu-1g on GCP):

Tier	Hourly	Monthly (730h)
On-Demand	$3.67	$2,679
Spot	$1.10	$803

9x A100 monthly cost:

On-Demand: 9 x $2,679 = $24,111 (over budget)
Spot: 9 x $803 = $7,227 (under budget!)

Step 6: Finalize the Configuration

Recommended Configuration:

Instance: a2-highgpu-1g (GCP)
Count: 9 instances
GPU: 9x A100 40GB (independent)
Pricing: Spot
Monthly Cost: $7,227

Throughput: ~500 tokens/second (9 x 55)
Latency: ~300ms (single request)

Alternatives considered:

Option	Instances	GPUs	Monthly Cost	Fits Budget
L4 spot	25	25x L4	$5,850	Yes but complex
A100 spot	9	9x A100 40GB	$7,227	Yes
A100 8-GPU spot	3	24x A100 40GB	$19,314	No

Step 7: Export and Document

Save as Artifact: Click Save as Artifact in the detail panel to capture your analysis with configuration, rationale, and trade-offs documented.

Real-World Patterns

Pattern: Balancing Latency and Cost

If P95 latency is critical:

Filter for H100 (3x faster than A100)
Accept higher hourly cost for fewer instances
Consider TP across multi-GPU instances for single-request speedup

Pattern: Training vs Inference Instances

For training workloads:

Prioritize memory bandwidth (A100/H100 HBM)
Consider interconnect for multi-GPU (NVLink > PCIe)
Look at total memory for large batch sizes

For inference:

Prioritize INT8/FP16 TOPS over memory bandwidth
L4 often best cost/performance for smaller models
A10G is middle ground between L4 and A100

Pattern: Cross-Cloud Arbitrage

Same GPU, different prices:

Sort by hourly rate to find cheapest option
Check spot availability across regions
GCP often cheaper for A100
AWS often cheaper for H100

What You’ve Accomplished

You now have a systematic approach to GPU selection:

Filtered instances by memory requirements
Compared throughput capability against requirements
Calculated monthly costs with different pricing tiers
Identified optimal configuration within budget

What’s Next

Your instance selection flows into other Lattice tools:

Memory Calculator: Verify model fits with detailed memory breakdown
Spot Instance Advisor: Configure spot strategy for selected instances
TCO Calculator: Include selected instance in total cost analysis
Stack Configuration: Apply instance type to infrastructure stack

Accelerator Registry is available in Lattice. Make hardware decisions with current, comparable data.

Ready to Try Lattice?

Get lifetime access to Lattice for confident AI infrastructure decisions.

Get Lattice for $99