How LLM Inference Actually Works
LLM inference has two distinct phases, each with different computational characteristics. Understanding this split is the foundation of every optimization decision.
Phase 1: Prefill
When a request arrives, the model processes all input tokens in parallel to compute their key-value (KV) pairs. This is the prefill phase. It is compute-intensive because attention is O(n²) in input length: every input token attends to every other input token. The GPU's tensor cores are fully utilized here.
The output of prefill is a KV cache: a set of key and value tensors for each layer and attention head, representing the "memory" of the input sequence. These tensors are stored in VRAM and reused for every subsequent output token.
Phase 2: Decode
After prefill, the model generates output tokens one at a time. Each decode step produces exactly one token. The new token attends to all previous tokens via the KV cache, then gets appended to it. This phase is memory-bandwidth bound: the GPU reads the entire model's weights from VRAM for each token but only does a small amount of computation per read.
This is the fundamental asymmetry. Prefill is compute-bound and parallelizable. Decode is sequential and memory-bound. The GPU spends most of its decode time waiting for data to arrive from VRAM, not doing math.
What this means in practice
For a 7B FP16 model, the GPU reads ~14 GB of weights from VRAM on every decode step. An A100 with 2 TB/s HBM2e bandwidth can do this in ~7ms. That puts a hard ceiling of ~140 tokens/second for a single sequence, regardless of how many CUDA cores you have. The only way to improve decode throughput is to process multiple sequences at once (batching), amortizing the weight-read cost across many tokens.
Training vs inference: a comparison
| Dimension | Training | Inference |
|---|---|---|
| Computation | Forward + backward pass | Forward pass only |
| Memory | Weights + gradients + optimizer states (~84 GB for 7B) | Weights + KV cache (~15-22 GB for 7B) |
| Bottleneck | Compute-bound (FLOPS) | Memory bandwidth-bound |
| Execution | Large fixed batches | Small, latency-sensitive requests |
| Parallelism | Full sequence processed at once | Sequential token generation |
| Fault tolerance | Checkpointing allows recovery | Failures impact live users |
The Metrics That Matter
Six metrics define LLM serving performance. Each measures a different aspect of the system, and optimizing for one often hurts another.
Time to First Token (TTFT)
The time from request arrival to the first output token. This is the user-perceived "thinking" time. For streaming chat applications, TTFT is the most important UX metric: everything after the first token feels like continuous output. Typical targets: <300ms for real-time chat, <1s for document QA, <100ms for code completion.
TTFT is dominated by prefill time plus queue wait. If 50 requests arrive at once and the server can only prefill one at a time, the 50th request waits for 49 prefills before its own starts.
Inter-Token Latency (ITL) and Time Per Output Token (TPOT)
ITL is the time between consecutive output tokens for a single request. TPOT is the average time per output token across the full generation. For a streaming chat, ITL determines reading speed. If ITL exceeds ~80ms, the output feels choppy.
End-to-End Latency
Total time from request arrival to final token. This is TTFT + (output_tokens x TPOT).
For non-streaming (batch) requests, this is the only latency metric that matters.
Example: TTFT = 300ms, TPOT = 20ms, 200 output tokens
Total latency = 300 + (200 x 20) = 4,300ms
Throughput (tokens/second and requests/second)
Total tokens generated across all concurrent requests per second. This is the metric for system capacity and cost efficiency. Higher throughput = more users per GPU = lower cost per request.
Goodput
The subset of throughput that produces useful output. If a request times out after generating 200 of 256 tokens, those 200 tokens count toward throughput but zero toward goodput. Goodput penalizes wasted computation and is a better measure of real system value.
The TTFT vs Throughput Trade-off
This is the central tension in LLM serving. Larger batch sizes increase throughput (more tokens per weight-read) but increase TTFT (new requests wait longer to enter the batch). The sweet spot depends on your workload:
| Batch size | TTFT | Throughput | Use case |
|---|---|---|---|
| Small (1-4) | ~100ms | ~500 tok/s | Interactive chat, code completion |
| Medium (8-32) | ~250ms | ~2,000 tok/s | Production chat (typical sweet spot) |
| Large (64-512) | ~800ms+ | ~4,000 tok/s | Batch processing, async workloads |
Continuous batching (used by vLLM and TGI) dynamically adjusts this: new requests join in-progress batches without waiting for the entire batch to finish, getting better throughput without sacrificing TTFT as badly.
Latency degrades under load
LLM latency is not a constant. It degrades nonlinearly with concurrent users:
1 concurrent user: TTFT = 200ms
10 concurrent users: TTFT = 400ms
50 concurrent users: TTFT = 2,000ms
Always test at peak expected concurrency, not average. The p99 latency under load is what determines whether your SLO holds.
Memory Is the Bottleneck, Not Compute
The single most important concept in LLM serving: your throughput ceiling is set by how many KV caches fit in the VRAM left over after loading model weights. Not by GPU FLOPS, not by CUDA core count. VRAM.
Model weight memory
The memory for model weights scales linearly with parameter count and inversely with quantization level:
| Precision | Bytes/param | 7B model | 35B model | 70B model |
|---|---|---|---|---|
| FP32 | 4 | 28 GB | 140 GB | 280 GB |
| FP16 / BF16 | 2 | 14 GB | 70 GB | 140 GB |
| INT8 | 1 | 7 GB | 35 GB | 70 GB |
| INT4 (GPTQ/AWQ) | 0.5 | 3.5 GB | 17.5 GB | 35 GB |
| FP8 | 1 | 7 GB | 35 GB | 70 GB |
KV cache memory
Each sequence in flight needs a KV cache. The size depends on model architecture and sequence length:
KV cache per sequence = 2 x num_layers x num_heads x head_dim x seq_len x bytes_per_element
Example (Llama-2 7B: 32 layers, 32 heads, 128 head_dim, 2048 seq_len, FP16):
= 2 x 32 x 32 x 128 x 2048 x 2 bytes
= 1.07 GB per sequence
On an A100 80GB with a 7B FP16 model (~14 GB weights + ~3 GB overhead), you have roughly 63 GB for KV cache. At 1.07 GB per sequence, that is about 59 concurrent sequences. With INT4 quantization, the model drops to ~3.5 GB, freeing 10.5 GB more. That is 10 more concurrent sequences, which directly translates to higher throughput.
The VRAM budget
Total VRAM = Model weights + KV cache + Activations + Framework overhead
Practical rule of thumb:
VRAM needed = (params x bytes_per_param x 1.2)
The 1.2 multiplier covers activations, CUDA context, and framework buffers.
HBM vs DDR: why GPUs win
Since LLM decode is memory-bandwidth bound, the type of memory matters enormously.
| Feature | CPU (DDR5) | GPU (HBM2e / HBM3) |
|---|---|---|
| Bandwidth | ~50-100 GB/s | ~2 TB/s (A100) / ~3.35 TB/s (H100) |
| Capacity | 32 GB - 2+ TB | 16 - 80 GB |
| Latency | ~70-100 ns | ~100-200 ns |
| Design goal | Optimized for serial latency | Optimized for parallel throughput |
Reading 14 GB of weights (7B FP16) takes ~7ms on A100 HBM vs ~140ms on server DDR5. That is a 20x gap. This is why running LLMs on CPU is so much slower, even with more total RAM.
Quantization: The First Lever to Pull
Quantization represents model weights (and sometimes activations) using fewer bits. It is the highest-ROI optimization for LLM serving: minimal accuracy loss, massive memory and throughput gains, and it can be applied post-training in minutes.
The quantization landscape
| Method | Bits | Type | Key idea |
|---|---|---|---|
| GPTQ | 4 | PTQ, weight-only | Layer-wise optimal quantization using second-order (Hessian) information |
| AWQ | 4 | PTQ, weight-only | Activation-aware: identifies the 1% of weights that matter most and protects them |
| FP8 | 8 | Full (weights + activations) | Hardware-native on H100; near-zero accuracy loss vs FP16 |
| bitsandbytes LLM.int8() | 8 | Mixed precision | Outlier channels stay FP16; rest quantized to INT8 |
| GGUF (llama.cpp) | 4-8 | Weight-only | CPU-friendly format for edge and desktop deployment |
PTQ vs QAT
Post-Training Quantization (PTQ) quantizes a pre-trained model without retraining. GPTQ and AWQ both work this way. Download a pre-quantized checkpoint and serve it immediately. This is what you use in production 95% of the time.
Quantization-Aware Training (QAT) simulates quantization during training, allowing the model to learn to compensate for precision loss. Better INT4 accuracy but requires full training compute. Use QAT only when PTQ quality is insufficient for your specific task.
Optimization order of operations
Apply optimizations in ROI order. Don't jump to complex techniques before exhausting simple ones:
- Quantization (INT8/INT4) - 2-4x memory reduction, minimal quality impact
- Continuous batching - 3-5x throughput, no quality change
- KV cache optimization - reduce TTFT for shared prefixes
- Smaller model + fine-tuning - if quantized large model does not meet SLO
- Distillation - if you need smaller than any available checkpoint
- Structured pruning - last resort, highest quality risk
Pruning and Distillation
Pruning
Pruning removes weights that contribute minimally to output quality. Two categories:
- Unstructured pruning: Zeros out individual weights with smallest magnitude. Can achieve 50-90% sparsity, but requires sparse tensor hardware support to see actual speedups. Without hardware support, a 50% sparse model is NOT 2x faster. SparseGPT achieves 50-60% sparsity on large models with minimal quality loss.
- Structured pruning: Removes entire attention heads, FFN neurons, or transformer layers. Results in a genuinely smaller dense model with immediate speedups on standard hardware. More impactful on accuracy since entire functional units are removed.
Distillation
Distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's soft probability distributions, not just hard labels:
Loss = alpha x CrossEntropy(student, ground_truth)
+ (1-alpha) x KL_Divergence(student_logits, teacher_logits / T)
Where T = temperature (higher = softer distributions = richer information)
The teacher's soft probabilities carry richer information than hard labels. If the teacher assigns 5% probability to a wrong answer, that tells the student something about the similarity structure of the output space.
Landmark examples: DistilBERT (40% smaller, 60% faster, 97% of BERT performance), TinyLlama (1.1B from Llama-2), Microsoft Phi series (frontier knowledge in 3B parameters).
Combining techniques
70B FP16 teacher
| Distillation
v
7B FP16 student (domain-specific)
| Quantization (AWQ INT4)
v
7B INT4 student (3.5 GB VRAM - fits on consumer GPU)
| Structured pruning (10% attention heads)
v
6.3B INT4 (3.15 GB VRAM)
| Dimension | Pruning | Quantization | Distillation |
|---|---|---|---|
| What changes | Removes weights/units | Reduces bit precision | Trains new smaller model |
| Memory reduction | Moderate | 2-8x | 2-10x |
| Accuracy impact | Moderate-High | Low (INT8~0%, INT4~1-3%) | Low (DistilBERT 97%) |
| Training required | Fine-tuning recommended | No (PTQ) | Yes (days-weeks) |
| Time to implement | Hours-days | Minutes-hours | Days-weeks |
| Best use case | Over-parameterized models | Production budget constraints | Purpose-built deployable models |
Batching Strategies
Since decode is bandwidth-bound, the only way to increase throughput is to process multiple sequences per weight-read. Batching is how you do this. Three strategies, each with different trade-offs.
All requests wait for the longest to finish
Requests join and leave dynamically
Static batching
Collect N requests, process them as a batch, return all results when the longest sequence finishes. Simple but wasteful: short sequences sit idle while waiting for long ones. GPU utilization is low. TTFT is high because new requests must wait for the current batch to complete.
Continuous batching
What vLLM, TGI, and modern serving frameworks use. New requests join the batch immediately without waiting for existing requests to finish. When a sequence completes, its slot opens for a new request. The batch size dynamically adjusts to demand.
Better GPU utilization, lower TTFT, higher throughput. The trade-off is implementation complexity: the framework must manage variable-length sequences in the same batch, which requires careful memory management. This is vLLM's PagedAttention innovation: it manages KV cache memory like an operating system manages virtual memory, allocating and freeing cache blocks dynamically.
Chunked prefill
When a new request arrives in continuous batching, its prefill (compute-heavy) competes with ongoing decode steps (memory-heavy) for GPU resources. A long prefill can stall all decodes in the batch.
Chunked prefill splits the prefill into smaller chunks that interleave with decode steps. This makes both
TTFT and ITL "tunable" by controlling chunk size. Smaller chunks = smoother decode latency but higher TTFT.
Larger chunks = faster prefill but spikier decode. vLLM supports this with --enable-chunked-prefill.
Operational Readiness
Serving an LLM is not just about model performance. The operational layer around the model determines whether it survives real traffic.
Metrics to monitor
Beyond TTFT and throughput, production LLM serving requires these additional metrics:
- GPU/VRAM utilization: Target 70-85% sustained VRAM. 100% means OOM crashes. Below 50% means wasted capacity.
- KV cache utilization: When this hits 100%, new requests queue. A leading indicator of throughput collapse.
- Request queue depth: Sustained depth > 10 means under-provisioned. A leading indicator of latency spikes.
- Error rate: Even 1% compounds under load because clients retry, amplifying traffic.
- Cost per 1K tokens:
(GPU_hourly_cost / 3600) / throughput_tps * 1000. Example: A100 at $3/hr doing 1,000 tok/s = $0.00083 per 1K tokens. - Batch size distribution: If consistently maxed: saturated. If avg=1: under-utilized.
Key vLLM metrics to dashboard:
vllm:e2e_request_latency_seconds- histogram with p50/p95/p99vllm:time_to_first_token_seconds- TTFT histogramvllm:num_requests_running- current batch sizevllm:num_requests_waiting- queue depthvllm:gpu_cache_usage_perc- KV cache utilization
Load testing
Self-benchmarks with custom scripts produce misleading results. They use different concurrency patterns, request distributions, and timeout behaviors than real traffic. Use the same tool your production traffic will use: Locust, k6, or wrk. Match the concurrency pattern, request distribution, and timeout settings.
Infrastructure Patterns
System-level hardening
Before tuning any model parameters, fix the OS and hardware layer:
- File descriptor limits:
ulimit -n 65535. The default 1024 causes "connection refused" under concurrent load. - GPU persistence mode:
nvidia-smi -pm 1. Without this, the GPU driver reinitializes on each process start. - GPU clock locking:
nvidia-smi -lgc 1410,1410. Without locked clocks, the GPU throttles under sustained load, causing gradual throughput degradation that looks like a software bug. - Disk space: Model checkpoints are 25-40 GB each. Check
df -hbefore anything else.
Scaling patterns
- Horizontal scaling: Multiple stateless vLLM instances behind a load balancer. Session state lives in Redis, not on the serving node.
- Queue-based autoscaling (KEDA): Traditional CPU/memory autoscaling does not work for LLMs. Scale on request queue depth instead.
- Load balancing: Use least-connections routing, not round-robin. LLM request durations vary wildly.
- Blue-green deployments: Deploy new model version to inactive environment, switch traffic atomically, roll back if metrics degrade.
- Circuit breakers: When an instance starts timing out, stop routing to it immediately instead of letting failures cascade.
Reliability patterns
- Auto-restart watchdog: A crash during production traffic is a permanent failure without one. Must include warmup, not just restart.
- Readiness probes: Don't route traffic until the model is loaded and warmed up. CUDA graph compilation can take minutes.
- Rate limiting: Token bucket or sliding window at the API gateway. Protect from burst abuse and runaway clients.
- Graceful degradation: Return 429 (Too Many Requests) with
Retry-Afterheader when overloaded. Better than 500s or timeouts.
Self-Hosting vs API
Not every LLM workload needs self-hosting. The decision depends on three factors.
Volume
< 1M tokens/day - API (infrastructure overhead not justified)
1M - 50M tokens/day - Evaluate case by case
> 50M tokens/day - Self-hosting likely more cost-effective
Example: 100M tokens/day
Via API ($0.001/1K): $100/day = $3,000/month
Self-hosted A100 (~$2/hr): ~$1,500/month with 7B model
Data sensitivity
| Low volume (<1M tok/day) | High volume (>50M tok/day) | |
|---|---|---|
| Sensitive data | Self-host or private cloud | Self-host (mandatory) |
| Non-sensitive | API (cheapest) | Self-host (economics) |
Team capability
- No MLOps engineers? Use API. Avoid hidden operational debt.
- Small team, fast iteration? API. Focus on product, not infrastructure.
- Dedicated platform team? Evaluate self-hosting for high-volume workloads.
- Need specific open-source or fine-tuned model? Self-host (can't run custom weights on most APIs).
Most mature products use a hybrid: API for development, experimentation, and frontier model tasks. Self-hosted for high-volume production paths and fine-tuned models.