AI Learning Ramp | Course 1

Program Strategy

This ramp prioritizes interview-important systems judgment before breadth. The first phase builds the mental model for how LLM requests move through serving infrastructure. The second connects that infra to retrieval, query engines, tools, agents, evals, reliability, security, and cost. The final phase turns the material into system-design answers, tradeoff trees, and mock interview prompts.

Course 1: LLM Serving Foundations

One-hour objective: after this session, you should be able to whiteboard the end-to-end path of a production LLM request, explain prefill vs decode, and name the core latency, throughput, and memory bottlenecks.

0-5 min

Set the system-design frame.

Anchor on a single user request: prompt assembly, routing, inference, streaming, post-processing, and observability.

5-17 min

Separate prefill from decode.

Read the concise lifecycle article and capture the difference between TTFT and inter-token latency.

17-35 min

Understand batching and KV cache pressure.

Read the selected continuous-batching sections and connect request length, concurrency, GPU utilization, and memory pressure.

35-45 min

Learn why vLLM matters.

Use the canonical vLLM blog to connect PagedAttention to virtual memory and higher-throughput serving.

45-52 min

Draw the request path.

Sketch tokenize, prefill, KV cache allocation, decode loop, streaming, finish condition, cache release, metrics, and logs.

52-60 min

Interview synthesis.

Answer: "Why can a long schema prompt hurt time-to-first-token, and why can many short follow-up queries still pressure decode throughput?"

Course 1 Reading List

Required reading is now three articles, about 40 minutes total. The fourth item is an optional transformer refresher only if the attention/KV-cache basics feel stale.

Required - 12 min

Prefill vs Decode: LLM Inference Phases Explained - Redis

A current, practical explanation of prefill, decode, KV cache, TTFT, and inter-token latency. It gives you the vocabulary interviewers expect when discussing LLM serving behavior.

Extract: prefill is the prompt-processing phase; decode is the token-by-token phase that repeatedly reuses cached K/V state.

Required - 18 min

Continuous Batching from First Principles - Hugging Face

The best fit for Course 1 because it derives KV caching, chunked prefill, and continuous batching from the user-visible behavior of chat systems. It is more directly useful here than a broad optimization survey.

Extract: continuous batching is iteration-level scheduling; it keeps GPU work full while requests enter and leave at different times.

Required - 10 min

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

The canonical vLLM introduction. It explains why KV cache fragmentation limits batching and how PagedAttention applies virtual-memory ideas to LLM serving.

Extract: PagedAttention is a memory-management design; better cache packing lets the serving layer batch more active sequences.

Optional - 8 min

The Illustrated Transformer - Jay Alammar

Use only as a quick refresher if Q/K/V, attention, and autoregressive decoding are not fresh. It is background, not a required AI infra article.

Extract: why decoding produces one token at a time even though each forward pass uses highly parallel matrix operations.

Selection Note

The original first-course list was too broad. Redis, Hugging Face, and the vLLM blog are the strongest required set for a one-hour serving-foundations session. Jay Alammar is useful but optional. NVIDIA's inference optimization article is high-quality, but it is better for Course 2 because it introduces too many knobs at once. The Anthropic and OpenAI agent articles move to the agent architecture week.

Course 1 Interview Drill

Use this as the 10-minute close. Speak out loud, as if an OpenAI or Anthropic interviewer asked you to design the system.

Prompt: A BigQuery-style GenAI query assistant has p95 TTFT above target when users ask questions over large schemas. Walk through the likely serving bottlenecks.
Start with the serving path: client request, auth and tenancy, prompt/context assembly, model router, inference server, token streaming, post-processing, metrics.
Name the prefill risks: long schemas, retrieved docs, few-shot examples, policy text, and tool definitions all increase prompt-processing cost and KV cache allocation.
Name the decode risks: long explanations, streaming follow-up analysis, high concurrency, and batch fragmentation can hurt throughput and inter-token latency.
Make the BigQuery connection: schema linking and dry-run validation are product-critical, but you need context selection and prompt caching so they do not explode prefill cost.
Close with tradeoffs: context pruning, prompt/KV caching, continuous batching, model routing, smaller draft models, SLO-aware admission control, and better eval data.

You Are Ready For Course 2 When You Can Explain

The difference between TTFT and inter-token latency, and which part of inference each one exposes.
Why KV cache grows with request length and concurrency.
How continuous batching differs from traditional request batching.
Why PagedAttention feels like virtual memory for GPU KV cache blocks.
Why long schemas and retrieved context can be a prefill problem before they become a quality problem.
How your query-engine experience maps to AI infra design: planning, routing, validation, observability, and cost.

8-Week Roadmap

Cadence: three sessions per week. The 24-session core runs from now into mid-July, leaving the second half of July for mock interviews, weak-spot repair, and company-specific prep.

Week 1 - Serving fundamentals

LLM request lifecycle: prefill, decode, KV cache, streaming.
Serving engines: vLLM, TGI, TensorRT-LLM, SGLang, routing.
Latency and cost: TTFT, ITL, throughput, SLOs, capacity planning.

Week 2 - Retrieval and context systems

RAG architecture: ingestion, chunking, embeddings, hybrid search.
Context engineering: ranking, compression, caching, provenance.
Warehouse-aware generation: schema linking, SQL validation, dry runs.

Week 3 - Agent architecture

Workflows vs agents: router, evaluator-optimizer, orchestrator-worker.
Tool use: function calling, MCP, permissions, sandboxing, retries.
Durable execution: state machines, checkpoints, queues, human approval.

Week 4 - Evals and reliability

Eval-driven development: golden sets, LLM judges, human review.
Agent observability: traces, tool receipts, state diffs, replay.
Production failure loops: drift, regressions, red-team cases, rollout gates.

Week 5 - Scaling AI infra

Load balancing, model routing, prompt caches, semantic caches.
GPU scheduling, multi-region tradeoffs, backpressure, quota design.
Long context and multimodal workloads: memory, storage, and streaming.

Week 6 - Data and query intelligence

Text-to-SQL and semantic layers for enterprise analytics.
Query planner analogies for agents: decomposition, cost, execution.
Correctness under uncertainty: grounding, citations, lineage, rollback.

Week 7 - Security, safety, and enterprise controls

Prompt injection, data exfiltration, tool abuse, and least privilege.
Policy and governance: audit logs, approval flows, retention, ZDR.
Safe coding and data agents: sandboxes, secrets, workspace isolation.

Week 8 - Interview execution

Mock: design ChatGPT Enterprise over private data.
Mock: design a Claude/OpenAI coding-agent platform.
Mock: design a BigQuery-native GenAI query and analysis assistant.

AI infra and agentic systems for frontier interviews.