LLMOps Inference Part 1 of 2

LLM Inference: The Theory
You Need Before Deploying

Prefill vs decode. Why memory bandwidth matters more than FLOPS. How to calculate VRAM budgets. Which quantization method to pick. The three batching strategies and when each one applies.

March 16, 2026 · 15 min read

Scroll to explore
This is Part 1 of 2. This post covers the theory. Part 2 covers what happened when I applied (and failed to apply) this theory during a live LLM serving exercise on an A100 GPU.
Section 1

How LLM Inference Actually Works

PrefillDecodeKV Cache

LLM inference has two distinct phases, each with different computational characteristics. Understanding this split is the foundation of every optimization decision.

Inference Pipeline
Input Tokens
User prompt
Prefill
Compute-bound
KV Cache
Stored in VRAM
Decode
Memory-bound
Output Tokens
One at a time

Phase 1: Prefill

When a request arrives, the model processes all input tokens in parallel to compute their key-value (KV) pairs. This is the prefill phase. It is compute-intensive because attention is O(n²) in input length: every input token attends to every other input token. The GPU's tensor cores are fully utilized here.

The output of prefill is a KV cache: a set of key and value tensors for each layer and attention head, representing the "memory" of the input sequence. These tensors are stored in VRAM and reused for every subsequent output token.

Phase 2: Decode

After prefill, the model generates output tokens one at a time. Each decode step produces exactly one token. The new token attends to all previous tokens via the KV cache, then gets appended to it. This phase is memory-bandwidth bound: the GPU reads the entire model's weights from VRAM for each token but only does a small amount of computation per read.

This is the fundamental asymmetry. Prefill is compute-bound and parallelizable. Decode is sequential and memory-bound. The GPU spends most of its decode time waiting for data to arrive from VRAM, not doing math.

What this means in practice

For a 7B FP16 model, the GPU reads ~14 GB of weights from VRAM on every decode step. An A100 with 2 TB/s HBM2e bandwidth can do this in ~7ms. That puts a hard ceiling of ~140 tokens/second for a single sequence, regardless of how many CUDA cores you have. The only way to improve decode throughput is to process multiple sequences at once (batching), amortizing the weight-read cost across many tokens.

2 TB/s
A100 HBM Bandwidth
~7ms
Weight Read (7B FP16)
~140
Max tok/s (single seq)
Key insight: Training uses ~6N FLOPs per token (forward + backward pass). Inference uses ~2N FLOPs per token (forward only). But training processes tokens in large parallel batches, while inference generates them one at a time. This is why inference is bandwidth-bound and training is compute-bound.

Training vs inference: a comparison

Training vs Inference
DimensionTrainingInference
ComputationForward + backward passForward pass only
MemoryWeights + gradients + optimizer states (~84 GB for 7B)Weights + KV cache (~15-22 GB for 7B)
BottleneckCompute-bound (FLOPS)Memory bandwidth-bound
ExecutionLarge fixed batchesSmall, latency-sensitive requests
ParallelismFull sequence processed at onceSequential token generation
Fault toleranceCheckpointing allows recoveryFailures impact live users
Section 2

The Metrics That Matter

TTFTThroughputLatency

Six metrics define LLM serving performance. Each measures a different aspect of the system, and optimizing for one often hurts another.

Time to First Token (TTFT)

The time from request arrival to the first output token. This is the user-perceived "thinking" time. For streaming chat applications, TTFT is the most important UX metric: everything after the first token feels like continuous output. Typical targets: <300ms for real-time chat, <1s for document QA, <100ms for code completion.

TTFT is dominated by prefill time plus queue wait. If 50 requests arrive at once and the server can only prefill one at a time, the 50th request waits for 49 prefills before its own starts.

Inter-Token Latency (ITL) and Time Per Output Token (TPOT)

ITL is the time between consecutive output tokens for a single request. TPOT is the average time per output token across the full generation. For a streaming chat, ITL determines reading speed. If ITL exceeds ~80ms, the output feels choppy.

End-to-End Latency

Total time from request arrival to final token. This is TTFT + (output_tokens x TPOT). For non-streaming (batch) requests, this is the only latency metric that matters.

Example: TTFT = 300ms, TPOT = 20ms, 200 output tokens
Total latency = 300 + (200 x 20) = 4,300ms

Throughput (tokens/second and requests/second)

Total tokens generated across all concurrent requests per second. This is the metric for system capacity and cost efficiency. Higher throughput = more users per GPU = lower cost per request.

Goodput

The subset of throughput that produces useful output. If a request times out after generating 200 of 256 tokens, those 200 tokens count toward throughput but zero toward goodput. Goodput penalizes wasted computation and is a better measure of real system value.

The TTFT vs Throughput Trade-off

This is the central tension in LLM serving. Larger batch sizes increase throughput (more tokens per weight-read) but increase TTFT (new requests wait longer to enter the batch). The sweet spot depends on your workload:

Batch Size Trade-offs
Batch sizeTTFTThroughputUse case
Small (1-4)~100ms~500 tok/sInteractive chat, code completion
Medium (8-32)~250ms~2,000 tok/sProduction chat (typical sweet spot)
Large (64-512)~800ms+~4,000 tok/sBatch processing, async workloads

Continuous batching (used by vLLM and TGI) dynamically adjusts this: new requests join in-progress batches without waiting for the entire batch to finish, getting better throughput without sacrificing TTFT as badly.

Latency degrades under load

LLM latency is not a constant. It degrades nonlinearly with concurrent users:

1 concurrent user:   TTFT = 200ms
10 concurrent users: TTFT = 400ms
50 concurrent users: TTFT = 2,000ms

Always test at peak expected concurrency, not average. The p99 latency under load is what determines whether your SLO holds.

Section 3

Memory Is the Bottleneck, Not Compute

VRAMKV CacheOOM

The single most important concept in LLM serving: your throughput ceiling is set by how many KV caches fit in the VRAM left over after loading model weights. Not by GPU FLOPS, not by CUDA core count. VRAM.

Model weight memory

The memory for model weights scales linearly with parameter count and inversely with quantization level:

Weight Memory by Precision
PrecisionBytes/param7B model35B model70B model
FP32428 GB140 GB280 GB
FP16 / BF16214 GB70 GB140 GB
INT817 GB35 GB70 GB
INT4 (GPTQ/AWQ)0.53.5 GB17.5 GB35 GB
FP817 GB35 GB70 GB

KV cache memory

Each sequence in flight needs a KV cache. The size depends on model architecture and sequence length:

KV cache per sequence = 2 x num_layers x num_heads x head_dim x seq_len x bytes_per_element

Example (Llama-2 7B: 32 layers, 32 heads, 128 head_dim, 2048 seq_len, FP16):
= 2 x 32 x 32 x 128 x 2048 x 2 bytes
= 1.07 GB per sequence

On an A100 80GB with a 7B FP16 model (~14 GB weights + ~3 GB overhead), you have roughly 63 GB for KV cache. At 1.07 GB per sequence, that is about 59 concurrent sequences. With INT4 quantization, the model drops to ~3.5 GB, freeing 10.5 GB more. That is 10 more concurrent sequences, which directly translates to higher throughput.

Interactive VRAM Calculator
7B
FP16
80 GB
Model Weights14.0 GB
Framework Overhead2.8 GB
Available for KV Cache63.2 GB
~59
Max Concurrent Seqs
1.07 GB
KV Cache / Seq
Model fits with room for KV cache

The VRAM budget

Total VRAM = Model weights + KV cache + Activations + Framework overhead

Practical rule of thumb:
  VRAM needed = (params x bytes_per_param x 1.2)
  The 1.2 multiplier covers activations, CUDA context, and framework buffers.
This is why quantization matters so much for serving. Quantization does not just make inference faster per token. It frees VRAM for more KV cache, which allows more concurrent sequences, which increases throughput. The memory savings compound.

HBM vs DDR: why GPUs win

Since LLM decode is memory-bandwidth bound, the type of memory matters enormously.

Memory Bandwidth Comparison
FeatureCPU (DDR5)GPU (HBM2e / HBM3)
Bandwidth~50-100 GB/s~2 TB/s (A100) / ~3.35 TB/s (H100)
Capacity32 GB - 2+ TB16 - 80 GB
Latency~70-100 ns~100-200 ns
Design goalOptimized for serial latencyOptimized for parallel throughput
20x
GPU vs CPU Bandwidth Gap
~7ms
A100 reads 14 GB weights
~140ms
DDR5 reads 14 GB weights

Reading 14 GB of weights (7B FP16) takes ~7ms on A100 HBM vs ~140ms on server DDR5. That is a 20x gap. This is why running LLMs on CPU is so much slower, even with more total RAM.

Section 4

Quantization: The First Lever to Pull

GPTQAWQFP8INT8

Quantization represents model weights (and sometimes activations) using fewer bits. It is the highest-ROI optimization for LLM serving: minimal accuracy loss, massive memory and throughput gains, and it can be applied post-training in minutes.

The quantization landscape

Quantization Methods
MethodBitsTypeKey idea
GPTQ4PTQ, weight-onlyLayer-wise optimal quantization using second-order (Hessian) information
AWQ4PTQ, weight-onlyActivation-aware: identifies the 1% of weights that matter most and protects them
FP88Full (weights + activations)Hardware-native on H100; near-zero accuracy loss vs FP16
bitsandbytes LLM.int8()8Mixed precisionOutlier channels stay FP16; rest quantized to INT8
GGUF (llama.cpp)4-8Weight-onlyCPU-friendly format for edge and desktop deployment
Quantization Impact
Memory vs FP16 Baseline
Throughput Gain
2-3x
INT4 Throughput Gain
0.25x
INT4 Memory vs FP16
~0%
FP8 Accuracy Drop

PTQ vs QAT

Post-Training Quantization (PTQ) quantizes a pre-trained model without retraining. GPTQ and AWQ both work this way. Download a pre-quantized checkpoint and serve it immediately. This is what you use in production 95% of the time.

Quantization-Aware Training (QAT) simulates quantization during training, allowing the model to learn to compensate for precision loss. Better INT4 accuracy but requires full training compute. Use QAT only when PTQ quality is insufficient for your specific task.

Optimization order of operations

Apply optimizations in ROI order. Don't jump to complex techniques before exhausting simple ones:

  1. Quantization (INT8/INT4) - 2-4x memory reduction, minimal quality impact
  2. Continuous batching - 3-5x throughput, no quality change
  3. KV cache optimization - reduce TTFT for shared prefixes
  4. Smaller model + fine-tuning - if quantized large model does not meet SLO
  5. Distillation - if you need smaller than any available checkpoint
  6. Structured pruning - last resort, highest quality risk
Section 5

Pruning and Distillation

PruningDistillationSparseGPT

Pruning

Pruning removes weights that contribute minimally to output quality. Two categories:

  • Unstructured pruning: Zeros out individual weights with smallest magnitude. Can achieve 50-90% sparsity, but requires sparse tensor hardware support to see actual speedups. Without hardware support, a 50% sparse model is NOT 2x faster. SparseGPT achieves 50-60% sparsity on large models with minimal quality loss.
  • Structured pruning: Removes entire attention heads, FFN neurons, or transformer layers. Results in a genuinely smaller dense model with immediate speedups on standard hardware. More impactful on accuracy since entire functional units are removed.

Distillation

Distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's soft probability distributions, not just hard labels:

Loss = alpha x CrossEntropy(student, ground_truth)
     + (1-alpha) x KL_Divergence(student_logits, teacher_logits / T)

Where T = temperature (higher = softer distributions = richer information)

The teacher's soft probabilities carry richer information than hard labels. If the teacher assigns 5% probability to a wrong answer, that tells the student something about the similarity structure of the output space.

Landmark examples: DistilBERT (40% smaller, 60% faster, 97% of BERT performance), TinyLlama (1.1B from Llama-2), Microsoft Phi series (frontier knowledge in 3B parameters).

Combining techniques

70B FP16 teacher
    | Distillation
    v
7B FP16 student (domain-specific)
    | Quantization (AWQ INT4)
    v
7B INT4 student (3.5 GB VRAM - fits on consumer GPU)
    | Structured pruning (10% attention heads)
    v
6.3B INT4 (3.15 GB VRAM)
Technique Comparison
DimensionPruningQuantizationDistillation
What changesRemoves weights/unitsReduces bit precisionTrains new smaller model
Memory reductionModerate2-8x2-10x
Accuracy impactModerate-HighLow (INT8~0%, INT4~1-3%)Low (DistilBERT 97%)
Training requiredFine-tuning recommendedNo (PTQ)Yes (days-weeks)
Time to implementHours-daysMinutes-hoursDays-weeks
Best use caseOver-parameterized modelsProduction budget constraintsPurpose-built deployable models
Section 6

Batching Strategies

StaticContinuousChunked Prefill

Since decode is bandwidth-bound, the only way to increase throughput is to process multiple sequences per weight-read. Batching is how you do this. Three strategies, each with different trade-offs.

Static vs Continuous Batching
Static Batching

All requests wait for the longest to finish

Continuous Batching

Requests join and leave dynamically

Static batching

Collect N requests, process them as a batch, return all results when the longest sequence finishes. Simple but wasteful: short sequences sit idle while waiting for long ones. GPU utilization is low. TTFT is high because new requests must wait for the current batch to complete.

Continuous batching

What vLLM, TGI, and modern serving frameworks use. New requests join the batch immediately without waiting for existing requests to finish. When a sequence completes, its slot opens for a new request. The batch size dynamically adjusts to demand.

Better GPU utilization, lower TTFT, higher throughput. The trade-off is implementation complexity: the framework must manage variable-length sequences in the same batch, which requires careful memory management. This is vLLM's PagedAttention innovation: it manages KV cache memory like an operating system manages virtual memory, allocating and freeing cache blocks dynamically.

Chunked prefill

When a new request arrives in continuous batching, its prefill (compute-heavy) competes with ongoing decode steps (memory-heavy) for GPU resources. A long prefill can stall all decodes in the batch.

Chunked prefill splits the prefill into smaller chunks that interleave with decode steps. This makes both TTFT and ITL "tunable" by controlling chunk size. Smaller chunks = smoother decode latency but higher TTFT. Larger chunks = faster prefill but spikier decode. vLLM supports this with --enable-chunked-prefill.

Not every deployment needs chunked prefill. For short inputs and high-throughput batch workloads, standard continuous batching is sufficient. Chunked prefill shines when you have long input sequences (RAG with 4K+ context) mixed with latency-sensitive decode.
Section 7

Operational Readiness

MonitoringPrometheusGrafana

Serving an LLM is not just about model performance. The operational layer around the model determines whether it survives real traffic.

Metrics to monitor

Beyond TTFT and throughput, production LLM serving requires these additional metrics:

  • GPU/VRAM utilization: Target 70-85% sustained VRAM. 100% means OOM crashes. Below 50% means wasted capacity.
  • KV cache utilization: When this hits 100%, new requests queue. A leading indicator of throughput collapse.
  • Request queue depth: Sustained depth > 10 means under-provisioned. A leading indicator of latency spikes.
  • Error rate: Even 1% compounds under load because clients retry, amplifying traffic.
  • Cost per 1K tokens: (GPU_hourly_cost / 3600) / throughput_tps * 1000. Example: A100 at $3/hr doing 1,000 tok/s = $0.00083 per 1K tokens.
  • Batch size distribution: If consistently maxed: saturated. If avg=1: under-utilized.
Observability Stack
vLLM
/metrics endpoint
Prometheus
Scrape 5-15s
Grafana
Dashboards
Alertmanager
PagerDuty / Slack

Key vLLM metrics to dashboard:

  • vllm:e2e_request_latency_seconds - histogram with p50/p95/p99
  • vllm:time_to_first_token_seconds - TTFT histogram
  • vllm:num_requests_running - current batch size
  • vllm:num_requests_waiting - queue depth
  • vllm:gpu_cache_usage_perc - KV cache utilization

Load testing

Self-benchmarks with custom scripts produce misleading results. They use different concurrency patterns, request distributions, and timeout behaviors than real traffic. Use the same tool your production traffic will use: Locust, k6, or wrk. Match the concurrency pattern, request distribution, and timeout settings.

Section 8

Infrastructure Patterns

ScalingReliabilityHardening

System-level hardening

Before tuning any model parameters, fix the OS and hardware layer:

  • File descriptor limits: ulimit -n 65535. The default 1024 causes "connection refused" under concurrent load.
  • GPU persistence mode: nvidia-smi -pm 1. Without this, the GPU driver reinitializes on each process start.
  • GPU clock locking: nvidia-smi -lgc 1410,1410. Without locked clocks, the GPU throttles under sustained load, causing gradual throughput degradation that looks like a software bug.
  • Disk space: Model checkpoints are 25-40 GB each. Check df -h before anything else.
407x
Warmup Difference (cold vs hot)
65535
Recommended ulimit -n
25-40 GB
Model Checkpoint Size

Scaling patterns

  • Horizontal scaling: Multiple stateless vLLM instances behind a load balancer. Session state lives in Redis, not on the serving node.
  • Queue-based autoscaling (KEDA): Traditional CPU/memory autoscaling does not work for LLMs. Scale on request queue depth instead.
  • Load balancing: Use least-connections routing, not round-robin. LLM request durations vary wildly.
  • Blue-green deployments: Deploy new model version to inactive environment, switch traffic atomically, roll back if metrics degrade.
  • Circuit breakers: When an instance starts timing out, stop routing to it immediately instead of letting failures cascade.

Reliability patterns

  • Auto-restart watchdog: A crash during production traffic is a permanent failure without one. Must include warmup, not just restart.
  • Readiness probes: Don't route traffic until the model is loaded and warmed up. CUDA graph compilation can take minutes.
  • Rate limiting: Token bucket or sliding window at the API gateway. Protect from burst abuse and runaway clients.
  • Graceful degradation: Return 429 (Too Many Requests) with Retry-After header when overloaded. Better than 500s or timeouts.
Section 9

Self-Hosting vs API

CostData PrivacyTeam Size

Not every LLM workload needs self-hosting. The decision depends on three factors.

Volume

< 1M tokens/day       -  API (infrastructure overhead not justified)
1M - 50M tokens/day   -  Evaluate case by case
> 50M tokens/day      -  Self-hosting likely more cost-effective

Example: 100M tokens/day
  Via API ($0.001/1K):        $100/day = $3,000/month
  Self-hosted A100 (~$2/hr):  ~$1,500/month with 7B model

Data sensitivity

Decision Matrix
Low volume (<1M tok/day)High volume (>50M tok/day)
Sensitive dataSelf-host or private cloudSelf-host (mandatory)
Non-sensitiveAPI (cheapest)Self-host (economics)

Team capability

  • No MLOps engineers? Use API. Avoid hidden operational debt.
  • Small team, fast iteration? API. Focus on product, not infrastructure.
  • Dedicated platform team? Evaluate self-hosting for high-volume workloads.
  • Need specific open-source or fine-tuned model? Self-host (can't run custom weights on most APIs).

Most mature products use a hybrid: API for development, experimentation, and frontier model tasks. Self-hosted for high-volume production paths and fine-tuned models.