Required reading is now three articles, about 40 minutes total. The fourth item is an optional transformer refresher only if the attention/KV-cache basics feel stale.
Required - 12 min
A current, practical explanation of prefill, decode, KV cache, TTFT, and inter-token latency. It gives you the vocabulary interviewers expect when discussing LLM serving behavior.
Extract: prefill is the prompt-processing phase; decode is the token-by-token phase that repeatedly reuses cached K/V state.
Required - 18 min
The best fit for Course 1 because it derives KV caching, chunked prefill, and continuous batching from the user-visible behavior of chat systems. It is more directly useful here than a broad optimization survey.
Extract: continuous batching is iteration-level scheduling; it keeps GPU work full while requests enter and leave at different times.
Required - 10 min
The canonical vLLM introduction. It explains why KV cache fragmentation limits batching and how PagedAttention applies virtual-memory ideas to LLM serving.
Extract: PagedAttention is a memory-management design; better cache packing lets the serving layer batch more active sequences.
Optional - 8 min
Use only as a quick refresher if Q/K/V, attention, and autoregressive decoding are not fresh. It is background, not a required AI infra article.
Extract: why decoding produces one token at a time even though each forward pass uses highly parallel matrix operations.
Selection Note
The original first-course list was too broad. Redis, Hugging Face, and the vLLM blog are the strongest required set for a one-hour serving-foundations session. Jay Alammar is useful but optional. NVIDIA's inference optimization article is high-quality, but it is better for Course 2 because it introduces too many knobs at once. The Anthropic and OpenAI agent articles move to the agent architecture week.