LLM Inference: The Theory You Need Before Deploying

This is Part 1 of 2. This post covers the theory. Part 2 covers what happened when I applied (and failed to apply) this theory during a live LLM serving exercise on an A100 GPU.

Contents

How LLM inference actually works The metrics that matter Memory is the bottleneck, not compute Quantization: the first lever to pull Pruning and distillation Batching strategies Operational readiness Infrastructure patterns Self-hosting vs API

Section 1

How LLM Inference Actually Works

PrefillDecodeKV Cache

LLM inference has two distinct phases, each with different computational characteristics. Understanding this split is the foundation of every optimization decision.

Inference Pipeline

Input Tokens

User prompt

Prefill

Compute-bound

KV Cache

Stored in VRAM

Decode

Memory-bound

Output Tokens

One at a time

Phase 1: Prefill

When a request arrives, the model processes all input tokens in parallel to compute their key-value (KV) pairs. This is the prefill phase. It is compute-intensive because attention is O(n²) in input length: every input token attends to every other input token. The GPU's tensor cores are fully utilized here.

The output of prefill is a KV cache: a set of key and value tensors for each layer and attention head, representing the "memory" of the input sequence. These tensors are stored in VRAM and reused for every subsequent output token.

Phase 2: Decode

After prefill, the model generates output tokens one at a time. Each decode step produces exactly one token. The new token attends to all previous tokens via the KV cache, then gets appended to it. This phase is memory-bandwidth bound: the GPU reads the entire model's weights from VRAM for each token but only does a small amount of computation per read.

This is the fundamental asymmetry. Prefill is compute-bound and parallelizable. Decode is sequential and memory-bound. The GPU spends most of its decode time waiting for data to arrive from VRAM, not doing math.

What this means in practice

For a 7B FP16 model, the GPU reads ~14 GB of weights from VRAM on every decode step. An A100 with 2 TB/s HBM2e bandwidth can do this in ~7ms. That puts a hard ceiling of ~140 tokens/second for a single sequence, regardless of how many CUDA cores you have. The only way to improve decode throughput is to process multiple sequences at once (batching), amortizing the weight-read cost across many tokens.

2 TB/s

A100 HBM Bandwidth

~7ms

Weight Read (7B FP16)

~140

Max tok/s (single seq)

Training uses ~6N FLOPs per token (forward + backward pass). Inference uses ~2N FLOPs per token (forward only). But training processes tokens in large parallel batches, while inference generates them one at a time. This is why inference is bandwidth-bound and training is compute-bound.

Training vs Inference

Dimension	Training	Inference
Computation	Forward + backward pass	Forward pass only
Memory	Weights + gradients + optimizer states (~84 GB for 7B)	Weights + KV cache (~15-22 GB for 7B)
Bottleneck	Compute-bound (FLOPS)	Memory bandwidth-bound
Execution	Large fixed batches	Small, latency-sensitive requests
Parallelism	Full sequence processed at once	Sequential token generation
Fault tolerance	Checkpointing allows recovery	Failures impact live users

Section 2

The Metrics That Matter

TTFTThroughputLatency

Six metrics define LLM serving performance. Each measures a different aspect of the system, and optimizing for one often hurts another.

Time to First Token (TTFT)

The time from request arrival to the first output token. This is the user-perceived "thinking" time. For streaming chat applications, TTFT is the most important UX metric: everything after the first token feels like continuous output. Typical targets: <300ms for real-time chat, <1s for document QA, <100ms for code completion.

TTFT is dominated by prefill time plus queue wait. If 50 requests arrive at once and the server can only prefill one at a time, the 50th request waits for 49 prefills before its own starts.

Inter-Token Latency (ITL) and Time Per Output Token (TPOT)

ITL is the time between consecutive output tokens for a single request. TPOT is the average time per output token across the full generation. For a streaming chat, ITL determines reading speed. If ITL exceeds ~80ms, the output feels choppy.

End-to-End Latency

Total time from request arrival to final token. This is TTFT + (output_tokens x TPOT). For non-streaming (batch) requests, this is the only latency metric that matters.

Example: TTFT = 300ms, TPOT = 20ms, 200 output tokens
Total latency = 300 + (200 x 20) = 4,300ms

Throughput (tokens/second and requests/second)

Total tokens generated across all concurrent requests per second. This is the metric for system capacity and cost efficiency. Higher throughput = more users per GPU = lower cost per request.

Goodput

The subset of throughput that produces useful output. If a request times out after generating 200 of 256 tokens, those 200 tokens count toward throughput but zero toward goodput. Goodput penalizes wasted computation and is a better measure of real system value.

The TTFT vs Throughput Trade-off

This is the central tension in LLM serving. Larger batch sizes increase throughput (more tokens per weight-read) but increase TTFT (new requests wait longer to enter the batch). The sweet spot depends on your workload:

Batch Size Trade-offs

Batch size	TTFT	Throughput	Use case
Small (1-4)	~100ms	~500 tok/s	Interactive chat, code completion
Medium (8-32)	~250ms	~2,000 tok/s	Production chat (typical sweet spot)
Large (64-512)	~800ms+	~4,000 tok/s	Batch processing, async workloads

Continuous batching (used by vLLM and TGI) dynamically adjusts this: new requests join in-progress batches without waiting for the entire batch to finish, getting better throughput without sacrificing TTFT as badly.

Latency degrades under load

LLM latency is not a constant. It degrades nonlinearly with concurrent users:

1 concurrent user:   TTFT = 200ms
10 concurrent users: TTFT = 400ms
50 concurrent users: TTFT = 2,000ms

Always test at peak expected concurrency, not average. The p99 latency under load is what determines whether your SLO holds.

Section 3

Memory Is the Bottleneck, Not Compute

VRAMKV CacheOOM

Your throughput ceiling is set by how many KV caches fit in the VRAM left over after loading model weights — not GPU FLOPS, not CUDA core count.

Model weight memory

The memory for model weights scales linearly with parameter count and inversely with quantization level:

Weight Memory by Precision

Precision	Bytes/param	7B model	35B model	70B model
FP32	4	28 GB	140 GB	280 GB
FP16 / BF16	2	14 GB	70 GB	140 GB
INT8	1	7 GB	35 GB	70 GB
INT4 (GPTQ/AWQ)	0.5	3.5 GB	17.5 GB	35 GB
FP8	1	7 GB	35 GB	70 GB

KV cache memory

Each sequence in flight needs a KV cache. The size depends on model architecture and sequence length:

KV cache per sequence = 2 x num_layers x num_heads x head_dim x seq_len x bytes_per_element

Example (Llama-2 7B: 32 layers, 32 heads, 128 head_dim, 2048 seq_len, FP16):
= 2 x 32 x 32 x 128 x 2048 x 2 bytes
= 1.07 GB per sequence

On an A100 80GB with a 7B FP16 model (~14 GB weights + ~3 GB overhead), you have roughly 63 GB for KV cache. At 1.07 GB per sequence, that is about 59 concurrent sequences. With INT4 quantization, the model drops to ~3.5 GB, freeing 10.5 GB more. That is 10 more concurrent sequences, which directly translates to higher throughput.

Interactive VRAM Calculator

Model Parameters

7B

Precision

FP16

GPU VRAM

80 GB

Model Weights14.0 GB

Framework Overhead2.8 GB

Available for KV Cache63.2 GB

~59

Max Concurrent Seqs

1.07 GB

KV Cache / Seq

Model fits with room for KV cache

The VRAM budget

Total VRAM = Model weights + KV cache + Activations + Framework overhead

Practical rule of thumb:
  VRAM needed = (params x bytes_per_param x 1.2)
  The 1.2 multiplier covers activations, CUDA context, and framework buffers.

Quantization speeds up inference per token and frees VRAM for more KV cache, which allows more concurrent sequences and increases throughput. The memory savings compound.

HBM vs DDR: why GPUs win

Since LLM decode is memory-bandwidth bound, the type of memory matters enormously.

Memory Bandwidth Comparison

Feature	CPU (DDR5)	GPU (HBM2e / HBM3)
Bandwidth	~50-100 GB/s	~2 TB/s (A100) / ~3.35 TB/s (H100)
Capacity	32 GB - 2+ TB	16 - 80 GB
Latency	~70-100 ns	~100-200 ns
Design goal	Optimized for serial latency	Optimized for parallel throughput

20x

GPU vs CPU Bandwidth Gap

~7ms

A100 reads 14 GB weights

~140ms

DDR5 reads 14 GB weights

Reading 14 GB of weights (7B FP16) takes ~7ms on A100 HBM vs ~140ms on server DDR5. That is a 20x gap. This is why running LLMs on CPU is so much slower, even with more total RAM.

Section 4

Quantization: The First Lever to Pull

GPTQAWQFP8INT8

Quantization represents model weights (and sometimes activations) using fewer bits. It is the highest-ROI optimization for LLM serving: minimal accuracy loss, massive memory and throughput gains, and it can be applied post-training in minutes.

Quantization Methods

Method	Bits	Type	Key idea
GPTQ	4	PTQ, weight-only	Layer-wise optimal quantization using second-order (Hessian) information
AWQ	4	PTQ, weight-only	Activation-aware: identifies the 1% of weights that matter most and protects them
FP8	8	Full (weights + activations)	Hardware-native on H100; near-zero accuracy loss vs FP16
bitsandbytes LLM.int8()	8	Mixed precision	Outlier channels stay FP16; rest quantized to INT8
GGUF (llama.cpp)	4-8	Weight-only	CPU-friendly format for edge and desktop deployment

Quantization Impact

Memory vs FP16 Baseline

Throughput Gain

2-3x

INT4 Throughput Gain

0.25x

INT4 Memory vs FP16

~0%

FP8 Accuracy Drop

PTQ vs QAT

Post-Training Quantization (PTQ) quantizes a pre-trained model without retraining. GPTQ and AWQ both work this way. Download a pre-quantized checkpoint and serve it immediately. This is what you use in production 95% of the time.

Quantization-Aware Training (QAT) simulates quantization during training, allowing the model to learn to compensate for precision loss. Better INT4 accuracy but requires full training compute. Use QAT only when PTQ quality is insufficient for your specific task.

Optimization order of operations

Apply optimizations in ROI order:

Quantization (INT8/INT4) - 2-4x memory reduction, minimal quality impact
Continuous batching - 3-5x throughput, no quality change
KV cache optimization - reduce TTFT for shared prefixes
Smaller model + fine-tuning - if quantized large model does not meet SLO
Distillation - if you need smaller than any available checkpoint
Structured pruning - last resort, highest quality risk

Section 5

Pruning and Distillation

PruningDistillationSparseGPT

Pruning

Pruning removes weights that contribute minimally to output quality. Two categories:

Unstructured pruning: Zeros out individual weights with smallest magnitude. Can achieve 50-90% sparsity, but requires sparse tensor hardware support to see actual speedups. Without hardware support, a 50% sparse model is NOT 2x faster. SparseGPT achieves 50-60% sparsity on large models with minimal quality loss.
Structured pruning: Removes entire attention heads, FFN neurons, or transformer layers. Results in a genuinely smaller dense model with immediate speedups on standard hardware. More impactful on accuracy since entire functional units are removed.

Distillation

Distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's soft probability distributions, not just hard labels:

Loss = alpha x CrossEntropy(student, ground_truth)
     + (1-alpha) x KL_Divergence(student_logits, teacher_logits / T)

Where T = temperature (higher = softer distributions = richer information)

The teacher's soft probabilities carry richer information than hard labels. If the teacher assigns 5% probability to a wrong answer, that tells the student something about the similarity structure of the output space.

Landmark examples: DistilBERT (40% smaller, 60% faster, 97% of BERT performance), TinyLlama (1.1B from Llama-2), Microsoft Phi series (frontier knowledge in 3B parameters).

Combining techniques

70B FP16 teacher
    | Distillation
    v
7B FP16 student (domain-specific)
    | Quantization (AWQ INT4)
    v
7B INT4 student (3.5 GB VRAM - fits on consumer GPU)
    | Structured pruning (10% attention heads)
    v
6.3B INT4 (3.15 GB VRAM)

Technique Comparison

Dimension	Pruning	Quantization	Distillation
What changes	Removes weights/units	Reduces bit precision	Trains new smaller model
Memory reduction	Moderate	2-8x	2-10x
Accuracy impact	Moderate-High	Low (INT8~0%, INT4~1-3%)	Low (DistilBERT 97%)
Training required	Fine-tuning recommended	No (PTQ)	Yes (days-weeks)
Time to implement	Hours-days	Minutes-hours	Days-weeks
Best use case	Over-parameterized models	Production budget constraints	Purpose-built deployable models

Section 6

Batching Strategies

StaticContinuousChunked Prefill

Since decode is bandwidth-bound, the only way to increase throughput is to process multiple sequences per weight-read. Batching is how you do this. Three strategies, each with different trade-offs.

Static vs Continuous Batching

Static Batching

All requests wait for the longest to finish

Continuous Batching

Requests join and leave dynamically

Static batching

Collect N requests, process them as a batch, return all results when the longest sequence finishes. Simple but wasteful: short sequences sit idle while waiting for long ones. GPU utilization is low. TTFT is high because new requests must wait for the current batch to complete.

Continuous batching

What vLLM, TGI, and modern serving frameworks use. New requests join the batch immediately without waiting for existing requests to finish. When a sequence completes, its slot opens for a new request. The batch size dynamically adjusts to demand.

Better GPU utilization, lower TTFT, higher throughput. The trade-off is implementation complexity: the framework must manage variable-length sequences in the same batch, which requires careful memory management. This is vLLM's PagedAttention innovation: it manages KV cache memory like an operating system manages virtual memory, allocating and freeing cache blocks dynamically.

Chunked prefill

When a new request arrives in continuous batching, its prefill (compute-heavy) competes with ongoing decode steps (memory-heavy) for GPU resources. A long prefill can stall all decodes in the batch.

Chunked prefill splits the prefill into smaller chunks that interleave with decode steps. This makes both TTFT and ITL "tunable" by controlling chunk size. Smaller chunks = smoother decode latency but higher TTFT. Larger chunks = faster prefill but spikier decode. vLLM supports this with --enable-chunked-prefill.

Not every deployment needs chunked prefill. For short inputs and high-throughput batch workloads, standard continuous batching is sufficient. Chunked prefill shines when you have long input sequences (RAG with 4K+ context) mixed with latency-sensitive decode.

Section 7

Operational Readiness

MonitoringPrometheusGrafana

The operational layer around the model determines whether it survives real traffic.

Metrics to monitor

Beyond TTFT and throughput, production LLM serving requires these additional metrics:

GPU/VRAM utilization: Target 70-85% sustained VRAM. 100% means OOM crashes. Below 50% means wasted capacity.
KV cache utilization: When this hits 100%, new requests queue. A leading indicator of throughput collapse.
Request queue depth: Sustained depth > 10 means under-provisioned. A leading indicator of latency spikes.
Error rate: Even 1% compounds under load because clients retry, amplifying traffic.
Cost per 1K tokens: (GPU_hourly_cost / 3600) / throughput_tps * 1000. Example: A100 at $3/hr doing 1,000 tok/s = $0.00083 per 1K tokens.
Batch size distribution: If consistently maxed: saturated. If avg=1: under-utilized.

Observability Stack

vLLM

/metrics endpoint

Prometheus

Scrape 5-15s

Grafana

Dashboards

Alertmanager

PagerDuty / Slack

vLLM metrics to dashboard:

vllm:e2e_request_latency_seconds - histogram with p50/p95/p99
vllm:time_to_first_token_seconds - TTFT histogram
vllm:num_requests_running - current batch size
vllm:num_requests_waiting - queue depth
vllm:gpu_cache_usage_perc - KV cache utilization

Load testing

Self-benchmarks with custom scripts produce misleading results. They use different concurrency patterns, request distributions, and timeout behaviors than real traffic. Use the same tool your production traffic will use: Locust, k6, or wrk. Match the concurrency pattern, request distribution, and timeout settings.

Section 8

Infrastructure Patterns

ScalingReliabilityHardening

System-level hardening

Before tuning any model parameters, fix the OS and hardware layer:

File descriptor limits: ulimit -n 65535. The default 1024 causes "connection refused" under concurrent load.
GPU persistence mode: nvidia-smi -pm 1. Without this, the GPU driver reinitializes on each process start.
GPU clock locking: nvidia-smi -lgc 1410,1410. Without locked clocks, the GPU throttles under sustained load, causing gradual throughput degradation that looks like a software bug.
Disk space: Model checkpoints are 25-40 GB each. Check df -h before anything else.

407x

Warmup Difference (cold vs hot)

65535

Recommended ulimit -n

25-40 GB

Model Checkpoint Size

Scaling patterns

Horizontal scaling: Multiple stateless vLLM instances behind a load balancer. Session state lives in Redis, not on the serving node.
Queue-based autoscaling (KEDA): Traditional CPU/memory autoscaling does not work for LLMs. Scale on request queue depth instead.
Load balancing: Use least-connections routing, not round-robin. LLM request durations vary wildly.
Blue-green deployments: Deploy new model version to inactive environment, switch traffic atomically, roll back if metrics degrade.
Circuit breakers: When an instance starts timing out, stop routing to it immediately instead of letting failures cascade.

Reliability patterns

Auto-restart watchdog: A crash during production traffic is a permanent failure without one. Must include warmup, not just restart.
Readiness probes: Don't route traffic until the model is loaded and warmed up. CUDA graph compilation can take minutes.
Rate limiting: Token bucket or sliding window at the API gateway. Protect from burst abuse and runaway clients.
Graceful degradation: Return 429 (Too Many Requests) with Retry-After header when overloaded. Better than 500s or timeouts.

Section 9

Self-Hosting vs API

CostData PrivacyTeam Size

Not every LLM workload needs self-hosting. The right choice depends on volume, data sensitivity, and team capability.

Volume

< 1M tokens/day       -  API (infrastructure overhead not justified)
1M - 50M tokens/day   -  Evaluate case by case
> 50M tokens/day      -  Self-hosting likely more cost-effective

Example: 100M tokens/day
  Via API ($0.001/1K):        $100/day = $3,000/month
  Self-hosted A100 (~$2/hr):  ~$1,500/month with 7B model

Decision Matrix by data sensitivity

	Low volume (<1M tok/day)	High volume (>50M tok/day)
Sensitive data	Self-host or private cloud	Self-host (mandatory)
Non-sensitive	API (cheapest)	Self-host (economics)

Team capability

No MLOps engineers? Use API. Avoid hidden operational debt.
Small team, fast iteration? API. Focus on product, not infrastructure.
Dedicated platform team? Evaluate self-hosting for high-volume workloads.
Need specific open-source or fine-tuned model? Self-host (can't run custom weights on most APIs).

Most mature products use a hybrid: API for development, experimentation, and frontier model tasks. Self-hosted for high-volume production paths and fine-tuned models.