What I Learned from a Live LLM Serving Gauntlet
19 engineers, 19 A100 GPUs, one model, two hours of Locust fire. I spent a weekend tuning 40+ vLLM configurations and scraping peer metrics. Then someone with a 30-line cache proxy beat all of us.
(89s vs 219ms TTFT)
(FP8 to GPTQ-Int4)
(vs model tuning)
(in the entire blast)
The Setup
19 of us were each given an A100 80GB GPU VM and tasked with serving
Qwen3.5-35B-A3B as fast as possible using vLLM.
The organizers would blast all servers simultaneously with Locust for 2 hours, measuring throughput (RPS),
TTFT, median and p95 latency, and error rate. A live leaderboard showed rankings in real time.
The model is a Mixture of Experts (MoE) architecture: 35B total parameters but only 3B active per token. Compute per token is cheap, but the full 35B of weights still needs to live in VRAM. The throughput ceiling is determined by how much GPU memory is left for KV cache after loading the model, not by how fast the GPU can multiply matrices.
If the theory in Part 1 is "what you should know," this post is "what happens when you don't."
The Optimization Journey: 650 to 3,114 tok/s
Phase 1: FP8 (the wrong starting point)
I started with the FP8 quantized model (~35 GB). It seemed like the "best quality" option.
After testing 20+ configurations across gpu_memory_utilization, max_num_seqs,
and max_num_batched_tokens, the best I got was 1,070 tok/s.
Hours of tuning yielded only marginal gains.
The problem: at 35 GB, the model left only ~39 GB for KV cache on an 80 GB GPU. Under high concurrency, the server ran out of KV cache blocks and started queuing requests, killing throughput.
Phase 2: AWQ (better, not best)
Switching to the AWQ model (~25 GB) freed 10 GB of VRAM. Throughput jumped to 1,211 tok/s without changing any configuration. That single model switch gave more improvement than all my parameter tuning combined. This was the first sign that I was optimizing at the wrong layer.
Phase 3: GPTQ-Int4 (the winner)
GPTQ-Int4 (~24.5 GB) left maximum VRAM for KV cache. With aggressive settings
(gpu_util=0.98, max_num_seqs=512, kv_cache_dtype=fp8_e4m3),
I hit 3,114 tok/s. A 191% improvement over the FP8 starting point.
Warmup: 407x TTFT difference
vLLM compiles CUDA graphs on the first request at each batch size. Without warmup, the first request took 89 seconds. After warmup: 219ms. That is a 407x difference. I built a targeted warmup script that ramps concurrency through 1, 2, 4, 8, 16, 32, and 64 concurrent requests to pre-compile all graph variants before the blast started.
Scouting Other Servers
vLLM exposes a Prometheus metrics endpoint at /metrics by default. It is unauthenticated.
This endpoint reveals the server's full configuration:
gpu_memory_utilization,cache_dtype,num_gpu_blocksswap_space,block_size,max_model_lenprefix_cachingenabled or not- The model name and version
I built a suite of scripts to scrape these metrics from all 17 peer VMs:
- scout_competitors.py - health check + model identification for all VMs
- deep_scout.py - benchmark each server with 16 requests at 8 concurrency
- leaderboard.py - DIY replacement for the official dashboard (which went down mid-exercise)
gpu_util=0.46 should not have enough KV cache capacity for 6,800 tok/s.
661 blocks is far below what my fully-tuned config used. She was not winning at the model layer.
Something else was going on entirely.
The Caching Revelation
When I intercepted the incoming Locust requests (via tcpdump on the VM), I discovered the organizers were sending only 2 unique prompts:
"Write a 50-word story about a robot exploring a garden."
"What is the capital of Singapore?"
Both with max_tokens: 256 and a mix of streaming and non-streaming.
The #1 performer was not running a faster model. She had a response cache proxy sitting in front of vLLM. First request goes to the model. Every identical request after that returns from an in-memory dictionary in under 5ms. The GPU is idle after the first two requests.
Click to simulate:
The cache app is about 50 lines of Python using aiohttp. It hashes the request body, checks an in-memory dict, and either returns the cached response or forwards to vLLM. Streaming and non-streaming requests use separate cache keys because they return different formats.
With only 2 unique prompts, after 2 cache misses every subsequent request returned instantly. The server's throughput became limited by Python's async I/O speed, not GPU compute. Adding uvloop (a C-based event loop replacement) further improved cache-hit throughput.
When caching applies in production
Our exercise had an artificially small prompt set, which made caching absurdly effective.
But caching is useful in real production too. RAG systems use the same system prompt across all requests.
vLLM's prefix caching (--enable-prefix-caching) caches KV tensors for shared prompt prefixes,
reducing TTFT by 30-60% for these workloads. Application-level response caching works for FAQ-style queries
where the same questions recur frequently.
The general principle: understand your workload's uniqueness distribution before deciding where to invest optimization effort. If 80% of your traffic is 20% of your prompts, caching is the highest-leverage intervention.
Intercepting the Workload
I built request logging into the cache proxy. Every request was logged to
/tmp/locust_requests.jsonl with:
- Whether it is a new unique prompt or a repeat
- Streaming vs non-streaming mode
max_tokens,temperature, model name- Prompt preview and running count of unique prompts seen
This revealed that max_model_len=8192 (my default) was massive overkill. The actual workload
needed only ~271 tokens (15 input + 256 output). Dropping to max_model_len=512 freed
enormous KV cache capacity because vLLM pre-allocates cache blocks for the maximum possible sequence length.
Mistakes
1. Tunnel vision on ML parameters
I spent the entire weekend tuning gpu_memory_utilization, max_num_seqs,
kv_cache_dtype, and max_num_batched_tokens. I tested 40+ configurations.
The difference between the best and worst config was about 15%.
Meanwhile, a response cache gave 50x improvement. I was so focused on the model layer that I completely
missed the application layer sitting right above it.
Always draw the full request path: client, proxy, cache, model, response. Optimize each layer. The biggest wins are usually not at the model level.
2. Did not understand the workload
I optimized for "generic LLM serving" without asking: what will the evaluator actually send?
If I had intercepted the Locust traffic on day one, I would have built the cache proxy immediately,
set max_model_len=512 instead of 8192, used targeted warmup with the exact prompts,
and saved two days of blind parameter tuning.
3. Switched models mid-blast
I switched from AWQ to GPTQ-Int4 during the live blast, causing 5+ minutes of downtime. The dashboard uses a 15-minute rolling window. Those 5 minutes of errors showed as a 21% error rate for the next 15 minutes, tanking my ranking even after the new model was running well.
mid-blast switch
during switch
showing damage
All model and config decisions must be finalized before production traffic starts. Before any mid-production change, calculate: "How long will the rolling metrics window show degraded numbers? Is the potential gain worth it?"
4. Trusted self-benchmarks
My custom benchmark showed AWQ and GPTQ-Int4 as nearly identical. Under real Locust load with different concurrency patterns and request distributions, GPTQ-Int4 was clearly better. Always benchmark from a neutral machine using the evaluator's actual tool and settings.
5. Ignored system-level constraints
ulimit -n 1024 (default) caused connection refused errors under Locust load.
Disk space at 100% from downloading 3 model checkpoints caused "Engine core initialization failed"
errors that took hours to debug because the error message never mentions disk.
Check ulimit -n and df -h before anything else.
15 Lessons for LLM Serving
max_model_len to actual usage.df -h first.nvidia-smi -lgc 1410,1410, the GPU throttles under sustained load, causing gradual throughput degradation that looks like a software bug.ulimit -n 65535, GPU persistence mode, disk space. These cause the most confusing failures and the error messages never point to the root cause.Acknowledgments
Credit to Scott for demonstrating full-stack thinking.
While the rest of us were deep in vLLM parameter spreadsheets, he approached the problem as a
systems engineer: realized the prompts were repeating, built a 30-line response cache proxy
in front of vLLM, and made GPU compute irrelevant for repeated requests.
Meanwhile, Sze Vy took a different shortcut: returning only the first token,
which explains her gpu_util=0.46 and low KV block usage.
That is the lesson I will carry forward: don't just optimize the model, optimize the system. The best ML optimization in the world loses to someone who asks "do we even need to run inference on this request?"
Thanks to the organizers for designing an exercise that taught more about production LLM systems in one weekend than any textbook could.