← Back to blog
LLMOps vLLM Inference Part 2 of 2

What I Learned from a Live LLM Serving Gauntlet

19 engineers, 19 A100 GPUs, one model, two hours of Locust fire. I spent a weekend tuning 40+ vLLM configurations and scraping peer metrics. Then someone with a 30-line cache proxy beat all of us.

Tech Stack
vLLM 0.17.1 A100 80GB Qwen3.5-35B-A3B GPTQ-Int4 Python Redis nginx Prometheus Locust
407x
Warmup difference
(89s vs 219ms TTFT)
191%
Throughput gain
(FP8 to GPTQ-Int4)
50x
Cache multiplier
(vs model tuning)
2
Unique prompts
(in the entire blast)

The Setup

19 of us were each given an A100 80GB GPU VM and tasked with serving Qwen3.5-35B-A3B as fast as possible using vLLM. The organizers would blast all servers simultaneously with Locust for 2 hours, measuring throughput (RPS), TTFT, median and p95 latency, and error rate. A live leaderboard showed rankings in real time.

The model is a Mixture of Experts (MoE) architecture: 35B total parameters but only 3B active per token. Compute per token is cheap, but the full 35B of weights still needs to live in VRAM. The throughput ceiling is determined by how much GPU memory is left for KV cache after loading the model, not by how fast the GPU can multiply matrices.

If the theory in Part 1 is "what you should know," this post is "what happens when you don't."

The Optimization Journey: 650 to 3,114 tok/s

Phase 1: FP8 (the wrong starting point)

I started with the FP8 quantized model (~35 GB). It seemed like the "best quality" option. After testing 20+ configurations across gpu_memory_utilization, max_num_seqs, and max_num_batched_tokens, the best I got was 1,070 tok/s. Hours of tuning yielded only marginal gains.

The problem: at 35 GB, the model left only ~39 GB for KV cache on an 80 GB GPU. Under high concurrency, the server ran out of KV cache blocks and started queuing requests, killing throughput.

Phase 2: AWQ (better, not best)

Switching to the AWQ model (~25 GB) freed 10 GB of VRAM. Throughput jumped to 1,211 tok/s without changing any configuration. That single model switch gave more improvement than all my parameter tuning combined. This was the first sign that I was optimizing at the wrong layer.

Phase 3: GPTQ-Int4 (the winner)

GPTQ-Int4 (~24.5 GB) left maximum VRAM for KV cache. With aggressive settings (gpu_util=0.98, max_num_seqs=512, kv_cache_dtype=fp8_e4m3), I hit 3,114 tok/s. A 191% improvement over the FP8 starting point.

Throughput Progression (tok/s)
1,070
FP8
~35 GB VRAM
Baseline
1,211
AWQ
~25 GB VRAM
+13%
3,114
GPTQ-Int4
~24.5 GB VRAM
+191%
Model selection is a 2-3x lever. Parameter tuning is a 10-15% lever. Always test all available quantizations before spending time on config sweeps.

Warmup: 407x TTFT difference

vLLM compiles CUDA graphs on the first request at each batch size. Without warmup, the first request took 89 seconds. After warmup: 219ms. That is a 407x difference. I built a targeted warmup script that ramps concurrency through 1, 2, 4, 8, 16, 32, and 64 concurrent requests to pre-compile all graph variants before the blast started.

Scouting Other Servers

vLLM exposes a Prometheus metrics endpoint at /metrics by default. It is unauthenticated. This endpoint reveals the server's full configuration:

  • gpu_memory_utilization, cache_dtype, num_gpu_blocks
  • swap_space, block_size, max_model_len
  • prefix_caching enabled or not
  • The model name and version

I built a suite of scripts to scrape these metrics from all 17 peer VMs:

  • scout_competitors.py - health check + model identification for all VMs
  • deep_scout.py - benchmark each server with 16 requests at 8 concurrency
  • leaderboard.py - DIY replacement for the official dashboard (which went down mid-exercise)
Scraped Metrics: #1 Performer (Scott)
0.46
gpu_util
661
KV blocks
6,800+
tok/s
fp8_e4m3
cache_dtype
Something does not add up. A model at gpu_util=0.46 should not have enough KV cache capacity for 6,800 tok/s. 661 blocks is far below what my fully-tuned config used. She was not winning at the model layer. Something else was going on entirely.
In production, secure your /metrics endpoint. In a friendly exercise, scraping peer configs is fair game and a great learning tool. In production, it exposes your entire serving configuration to anyone with network access.

The Caching Revelation

When I intercepted the incoming Locust requests (via tcpdump on the VM), I discovered the organizers were sending only 2 unique prompts:

"Write a 50-word story about a robot exploring a garden."
"What is the capital of Singapore?"

Both with max_tokens: 256 and a mix of streaming and non-streaming.

The #1 performer was not running a faster model. She had a response cache proxy sitting in front of vLLM. First request goes to the model. Every identical request after that returns from an in-memory dictionary in under 5ms. The GPU is idle after the first two requests.

Cache Proxy Architecture
Locust
organizers
cache_app.py
port 8000
Cache Check
hash(body)
HIT
return cached response
<5ms
MISS
forward to vLLM (port 8001)
cache + return

Click to simulate:

The cache app is about 50 lines of Python using aiohttp. It hashes the request body, checks an in-memory dict, and either returns the cached response or forwards to vLLM. Streaming and non-streaming requests use separate cache keys because they return different formats.

With only 2 unique prompts, after 2 cache misses every subsequent request returned instantly. The server's throughput became limited by Python's async I/O speed, not GPU compute. Adding uvloop (a C-based event loop replacement) further improved cache-hit throughput.

The real lesson: I spent an entire weekend tuning vLLM parameters to squeeze out an extra 10-15% throughput. A 30-line Python cache proxy gave a 50x improvement. Full-stack thinking beats ML-layer tunnel vision every time.

When caching applies in production

Our exercise had an artificially small prompt set, which made caching absurdly effective. But caching is useful in real production too. RAG systems use the same system prompt across all requests. vLLM's prefix caching (--enable-prefix-caching) caches KV tensors for shared prompt prefixes, reducing TTFT by 30-60% for these workloads. Application-level response caching works for FAQ-style queries where the same questions recur frequently.

The general principle: understand your workload's uniqueness distribution before deciding where to invest optimization effort. If 80% of your traffic is 20% of your prompts, caching is the highest-leverage intervention.

Intercepting the Workload

I built request logging into the cache proxy. Every request was logged to /tmp/locust_requests.jsonl with:

  • Whether it is a new unique prompt or a repeat
  • Streaming vs non-streaming mode
  • max_tokens, temperature, model name
  • Prompt preview and running count of unique prompts seen

This revealed that max_model_len=8192 (my default) was massive overkill. The actual workload needed only ~271 tokens (15 input + 256 output). Dropping to max_model_len=512 freed enormous KV cache capacity because vLLM pre-allocates cache blocks for the maximum possible sequence length.

Rule: Always instrument your server to log incoming requests. Understanding the workload is half the optimization. The first question for any serving optimization: "What exactly does the client send, and what does it measure?"

Mistakes

1. Tunnel vision on ML parameters

I spent the entire weekend tuning gpu_memory_utilization, max_num_seqs, kv_cache_dtype, and max_num_batched_tokens. I tested 40+ configurations. The difference between the best and worst config was about 15%. Meanwhile, a response cache gave 50x improvement. I was so focused on the model layer that I completely missed the application layer sitting right above it.

Always draw the full request path: client, proxy, cache, model, response. Optimize each layer. The biggest wins are usually not at the model level.

2. Did not understand the workload

I optimized for "generic LLM serving" without asking: what will the evaluator actually send? If I had intercepted the Locust traffic on day one, I would have built the cache proxy immediately, set max_model_len=512 instead of 8192, used targeted warmup with the exact prompts, and saved two days of blind parameter tuning.

3. Switched models mid-blast

I switched from AWQ to GPTQ-Int4 during the live blast, causing 5+ minutes of downtime. The dashboard uses a 15-minute rolling window. Those 5 minutes of errors showed as a 21% error rate for the next 15 minutes, tanking my ranking even after the new model was running well.

21%
Error rate from
mid-blast switch
5 min
Downtime
during switch
15 min
Rolling window
showing damage

All model and config decisions must be finalized before production traffic starts. Before any mid-production change, calculate: "How long will the rolling metrics window show degraded numbers? Is the potential gain worth it?"

4. Trusted self-benchmarks

My custom benchmark showed AWQ and GPTQ-Int4 as nearly identical. Under real Locust load with different concurrency patterns and request distributions, GPTQ-Int4 was clearly better. Always benchmark from a neutral machine using the evaluator's actual tool and settings.

5. Ignored system-level constraints

ulimit -n 1024 (default) caused connection refused errors under Locust load. Disk space at 100% from downloading 3 model checkpoints caused "Engine core initialization failed" errors that took hours to debug because the error message never mentions disk. Check ulimit -n and df -h before anything else.

15 Lessons for LLM Serving

01
Model selection matters 10x more than parameter tuning.
Don't tune parameters on a suboptimal model. Test all quantizations first. For MoE models, smaller footprint = more KV cache = more concurrency = higher throughput.
02
KV cache size is the throughput bottleneck.
LLM decode is memory-bandwidth bound. What limits concurrency is how many KV caches fit in the VRAM left after loading weights. Every GB saved on model weights translates directly to more concurrent sequences.
03
Understand your workload before optimizing.
The first question for any optimization: "What exactly does the client send?" Intercept real traffic. Log requests. Match max_model_len to actual usage.
04
Caching is the real multiplier.
For repetitive workloads, a response cache beats any amount of model-level tuning. Even in production with diverse queries, prefix caching provides 30-60% TTFT reduction for shared system prompts.
05
Warmup is non-negotiable.
Cold TTFT: 89 seconds. Warm TTFT: 219ms. That is 407x. vLLM compiles CUDA graphs lazily. Warmup before accepting traffic.
06
Quantization is the highest-ROI optimization.
INT4 (GPTQ/AWQ) reduces memory 4x vs FP16 with 1-3% accuracy loss. This frees VRAM for KV cache and enables higher concurrency. Apply it first.
07
Secure your /metrics endpoint.
vLLM's Prometheus endpoint is unauthenticated and reveals full server configuration. In production, put it behind auth. In a learning exercise, use it to learn from peers.
08
Disk space kills silently.
Multiple model downloads fill disk fast. vLLM's error when disk is full does not mention disk. Always check df -h first.
09
Lock GPU clocks for sustained load.
Without nvidia-smi -lgc 1410,1410, the GPU throttles under sustained load, causing gradual throughput degradation that looks like a software bug.
10
Auto-restart watchdogs are essential.
A crash during a multi-hour production window is a permanent failure without one. The watchdog must include warmup, not just restart.
11
Self-benchmarks lie.
Benchmark from a neutral machine using the evaluator's tool, concurrency pattern, and request distribution. Internal benchmarks produce systematically optimistic results.
12
Never change configs mid-production.
Rolling metrics windows amplify the cost of downtime. Five minutes of errors shows as 15 minutes of degraded metrics. Finalize all decisions before traffic starts.
13
Fix system limits before model tuning.
ulimit -n 65535, GPU persistence mode, disk space. These cause the most confusing failures and the error messages never point to the root cause.
14
Continuous batching is table stakes.
Static batching wastes GPU cycles waiting for the longest sequence. vLLM's continuous batching with PagedAttention should be your default.
15
Think full-stack, not model-stack.
Draw the request path: client, proxy, cache, model, response. The biggest wins are often at the application layer (caching, routing, workload interception), not the ML layer. Optimize each layer independently.

Acknowledgments

Credit to Scott for demonstrating full-stack thinking. While the rest of us were deep in vLLM parameter spreadsheets, he approached the problem as a systems engineer: realized the prompts were repeating, built a 30-line response cache proxy in front of vLLM, and made GPU compute irrelevant for repeated requests. Meanwhile, Sze Vy took a different shortcut: returning only the first token, which explains her gpu_util=0.46 and low KV block usage.

That is the lesson I will carry forward: don't just optimize the model, optimize the system. The best ML optimization in the world loses to someone who asks "do we even need to run inference on this request?"

Thanks to the organizers for designing an exercise that taught more about production LLM systems in one weekend than any textbook could.

Related Posts
LLM Inference: The Theory You Need
Prefill vs decode, VRAM math, quantization methods
DPO Interactive Demo
Learn RLHF and DPO from scratch with interactive demos
Visualizing Weak-Driven Learning (WMSS)
Interactive walkthrough of gradient saturation and logit mixing