Phase 11 - Lesson 02
Few-Shot, Chain-of-Thought, Tree-of-Thought
This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.
Telling a model what to do is prompting. Showing it how to think is engineering. The gap between 78% and 91% accuracy on the same model, same task, same data is not a better model. It is a better reasoning strategy.
Type: Build Languages: Python Prerequisites: Lesson 11.01 (Prompt Engineering) Time: ~45 minutes
Learning Objectives
- Implement few-shot prompting by selecting and formatting example demonstrations that maximize task accuracy
- Apply chain-of-thought (CoT) reasoning to improve accuracy on multi-step problems like math word problems
- Build a tree-of-thought prompt that explores multiple reasoning paths and selects the best one
- Measure the accuracy improvement from zero-shot vs few-shot vs CoT on a standard benchmark
The Problem
You build a math tutoring app. Your prompt says: "Solve this word problem." GPT-5 gets it right 94% of the time on GSM8K, the standard grade-school math benchmark. You think you already peaked. You do not — chain-of-thought still adds 3-4 points.
Add five words -- "Let's think step by step" -- and accuracy jumps to 91%. Add a few worked examples and it reaches 95%. Same model. Same temperature. Same API cost. The only difference is that you gave the model scratch paper.
This is not a hack. It is how reasoning works. Humans do not solve multi-step problems in one mental leap. Neither do transformers. When you force a model to generate intermediate tokens, those tokens become part of the context for the next token. Each reasoning step feeds the next. The model literally computes its way to the answer.
But "think step by step" is the beginning, not the end. What if you sampled five reasoning paths and took a majority vote? What if you let the model explore a tree of possibilities, evaluating and pruning branches? What if you interleaved reasoning with tool use? These are not hypotheticals. They are published techniques with measured improvements, and you will build all of them in this lesson.
The Concept
Zero-Shot vs Few-Shot: When Examples Beat Instructions
Zero-shot prompting gives the model a task and nothing else. Few-shot prompting gives it examples first.
Wei et al. (2022) measured this across 8 benchmarks. For simple tasks like sentiment classification, zero-shot and few-shot performed within 2% of each other. For complex tasks like multi-step arithmetic and symbolic reasoning, few-shot improved accuracy by 10-25%.
The intuition: examples are compressed instructions. Instead of describing the output format, you show it. Instead of explaining the reasoning process, you demonstrate it. The model pattern-matches on the examples more reliably than it interprets abstract instructions.
graph TD
subgraph Comparison["Zero-Shot vs Few-Shot"]
direction LR
Z["Zero-Shot\n'Classify this review'\nModel guesses format\n78% on GSM8K"]
F["Few-Shot\n'Here are 3 examples...\nNow classify this review'\nModel matches pattern\n85% on GSM8K"]
end
Z ~~~ F
style Z fill:#1a1a2e,stroke:#e94560,color:#fff
style F fill:#1a1a2e,stroke:#51cf66,color:#fff
When few-shot wins: format-sensitive tasks, classification, structured extraction, domain-specific jargon, any task where the model needs to match a specific pattern.
When zero-shot wins: simple factual questions, creative tasks where examples constrain creativity, tasks where finding good examples is harder than writing good instructions.
Example Selection: Similar Beats Random
Not all examples are equal. Choosing examples similar to the target input outperforms random selection by 5-15% on classification tasks (Liu et al., 2022). Three principles:
- Semantic similarity: pick examples closest to the input in embedding space
- Label diversity: cover all output categories in your examples
- Difficulty matching: match the complexity level of the target problem
The optimal number of examples for most tasks is 3-5. Below 3, the model does not have enough signal to extract the pattern. Above 5, you hit diminishing returns and waste context window tokens. For classification with many labels, use one example per label.
Chain-of-Thought: Giving Models Scratch Paper
Chain-of-Thought (CoT) prompting was introduced by Wei et al. (2022) at Google Brain. The idea is simple: instead of asking the model for just the answer, ask it to show its reasoning steps first.
graph LR
subgraph Standard["Standard Prompting"]
Q1["Q: Roger has 5 balls.\nHe buys 2 cans of 3.\nHow many balls?"] --> A1["A: 11"]
end
subgraph CoT["Chain-of-Thought Prompting"]
Q2["Q: Roger has 5 balls.\nHe buys 2 cans of 3.\nHow many balls?"] --> R2["Roger starts with 5.\n2 cans of 3 = 6.\n5 + 6 = 11."] --> A2["A: 11"]
end
style Q1 fill:#1a1a2e,stroke:#e94560,color:#fff
style A1 fill:#1a1a2e,stroke:#e94560,color:#fff
style Q2 fill:#1a1a2e,stroke:#51cf66,color:#fff
style R2 fill:#1a1a2e,stroke:#ffa500,color:#fff
style A2 fill:#1a1a2e,stroke:#51cf66,color:#fff
Why does this work mechanically? Each token a transformer generates becomes context for the next token. Without CoT, the model must compress all reasoning into the hidden state of a single forward pass. With CoT, the model externalizes intermediate computations as tokens. Each reasoning token extends the effective computation depth.
GSM8K benchmarks (grade-school math, 8.5K problems):
| Model | Zero-Shot | Zero-Shot CoT | Few-Shot CoT |
|---|---|---|---|
| GPT-4o | 78% | 91% | 95% |
| GPT-5 | 94% | 97% | 98% |
| o4-mini (reasoning) | 97% | — | — |
| Claude Opus 4.7 | 93% | 97% | 98% |
| Gemini 3 Pro | 92% | 96% | 98% |
| Llama 4 70B | 80% | 89% | 94% |
| DeepSeek-V3.1 | 89% | 94% | 96% |
Note on reasoning models. Models like OpenAI's o-series (o3, o4-mini) and DeepSeek-R1 run chain-of-thought internally before emitting their answer. Adding "Let's think step by step" to a reasoning model is redundant and sometimes counterproductive — they have already done it.
Two flavors of CoT:
Zero-shot CoT: append "Let's think step by step" to the prompt. No examples needed. Kojima et al. (2022) showed this single sentence improves accuracy across arithmetic, commonsense, and symbolic reasoning tasks.
Few-shot CoT: provide examples that include reasoning steps. More effective than zero-shot CoT because the model sees the exact reasoning format you expect.
When CoT hurts: simple factual recall ("What is the capital of France?"), single-step classification, tasks where speed matters more than accuracy. CoT adds 50-200 tokens of reasoning overhead per query. For high-throughput, low-complexity tasks, that is wasted cost.
Self-Consistency: Sample Many, Vote Once
Wang et al. (2023) introduced self-consistency. The insight: a single CoT path might contain reasoning errors. But if you sample N independent reasoning paths (using temperature > 0) and take the majority vote on the final answer, errors cancel out.
graph TD
P["Problem: 'A store has 48 apples.\nThey sell 1/3 on Monday\nand 1/4 of the rest on Tuesday.\nHow many are left?'"]
P --> Path1["Path 1: 48 - 16 = 32\n32 - 8 = 24\nAnswer: 24"]
P --> Path2["Path 2: 1/3 of 48 = 16\nRemaining: 32\n1/4 of 32 = 8\n32 - 8 = 24\nAnswer: 24"]
P --> Path3["Path 3: 48/3 = 16 sold\n48 - 16 = 32\n32/4 = 8 sold\n32 - 8 = 24\nAnswer: 24"]
P --> Path4["Path 4: Sell 1/3: 48 - 12 = 36\nSell 1/4: 36 - 9 = 27\nAnswer: 27"]
P --> Path5["Path 5: Monday: 48 * 2/3 = 32\nTuesday: 32 * 3/4 = 24\nAnswer: 24"]
Path1 --> V["Majority Vote\n24: 4 votes\n27: 1 vote\nFinal: 24"]
Path2 --> V
Path3 --> V
Path4 --> V
Path5 --> V
style P fill:#1a1a2e,stroke:#ffa500,color:#fff
style Path1 fill:#1a1a2e,stroke:#51cf66,color:#fff
style Path2 fill:#1a1a2e,stroke:#51cf66,color:#fff
style Path3 fill:#1a1a2e,stroke:#51cf66,color:#fff
style Path4 fill:#1a1a2e,stroke:#e94560,color:#fff
style Path5 fill:#1a1a2e,stroke:#51cf66,color:#fff
style V fill:#1a1a2e,stroke:#51cf66,color:#fff
Self-consistency improved GSM8K accuracy from 56.5% (single CoT) to 74.4% with N=40 on the original PaLM 540B experiments. On GPT-5 the improvement is small (97% to 98%) because base accuracy is already saturated. The technique shines most on models with 60-85% base CoT accuracy -- the sweet spot where single-path errors are frequent but not systematic. For reasoning models (o-series, R1) self-consistency is subsumed by the built-in internal sampling.
The tradeoff: N samples means Nx the API cost and latency. In practice, N=5 captures most of the benefit. N=3 is the minimum for a meaningful vote. N > 10 has diminishing returns for most tasks.
Tree-of-Thought: Branching Exploration
Yao et al. (2023) introduced Tree-of-Thought (ToT). Where CoT follows one linear reasoning path, ToT explores multiple branches and evaluates which are most promising before continuing.
graph TD
Root["Problem"] --> B1["Thought 1a"]
Root --> B2["Thought 1b"]
Root --> B3["Thought 1c"]
B1 --> E1["Eval: 0.8"]
B2 --> E2["Eval: 0.3"]
B3 --> E3["Eval: 0.9"]
E1 -->|Continue| B1a["Thought 2a"]
E1 -->|Continue| B1b["Thought 2b"]
E3 -->|Continue| B3a["Thought 2a"]
E3 -->|Continue| B3b["Thought 2b"]
E2 -->|Prune| X["X"]
B1a --> E4["Eval: 0.7"]
B3a --> E5["Eval: 0.95"]
E5 -->|Best path| Final["Solution"]
style Root fill:#1a1a2e,stroke:#ffa500,color:#fff
style E2 fill:#1a1a2e,stroke:#e94560,color:#fff
style X fill:#1a1a2e,stroke:#e94560,color:#fff
style E5 fill:#1a1a2e,stroke:#51cf66,color:#fff
style Final fill:#1a1a2e,stroke:#51cf66,color:#fff
style B1 fill:#1a1a2e,stroke:#808080,color:#fff
style B2 fill:#1a1a2e,stroke:#808080,color:#fff
style B3 fill:#1a1a2e,stroke:#808080,color:#fff
style B1a fill:#1a1a2e,stroke:#808080,color:#fff
style B1b fill:#1a1a2e,stroke:#808080,color:#fff
style B3a fill:#1a1a2e,stroke:#808080,color:#fff
style B3b fill:#1a1a2e,stroke:#808080,color:#fff
style E1 fill:#1a1a2e,stroke:#808080,color:#fff
style E3 fill:#1a1a2e,stroke:#808080,color:#fff
style E4 fill:#1a1a2e,stroke:#808080,color:#fff
ToT has three components:
- Thought generation: produce multiple candidate next-steps
- State evaluation: score each candidate (can use the LLM itself as evaluator)
- Search algorithm: BFS or DFS through the tree, pruning low-scoring branches
On the Game of 24 task (combine 4 numbers using arithmetic to make 24), GPT-4 with standard prompting solves 7.3% of problems. With CoT, 4.0% (CoT actually hurts here because the search space is wide). With ToT, 74%.
ToT is expensive. Each node in the tree requires an LLM call. A tree with branching factor 3 and depth 3 requires up to 39 LLM calls. Use it only for problems where the search space is large but evaluatable -- planning, puzzle solving, creative problem-solving with constraints.
ReAct: Thinking + Doing
Yao et al. (2022) combined reasoning traces with actions. The model alternates between thinking (generating reasoning) and acting (calling tools, searching, computing).
graph LR
Q["Question:\nWhat is the\npopulation of the\ncountry where\nthe Eiffel Tower\nis located?"]
T1["Thought: I need to\nfind which country\nhas the Eiffel Tower"]
A1["Action: search\n'Eiffel Tower location'"]
O1["Observation:\nParis, France"]
T2["Thought: Now I need\nFrance's population"]
A2["Action: search\n'France population 2024'"]
O2["Observation:\n68.4 million"]
T3["Thought: I have\nthe answer"]
F["Answer:\n68.4 million"]
Q --> T1 --> A1 --> O1 --> T2 --> A2 --> O2 --> T3 --> F
style Q fill:#1a1a2e,stroke:#ffa500,color:#fff
style T1 fill:#1a1a2e,stroke:#51cf66,color:#fff
style A1 fill:#1a1a2e,stroke:#e94560,color:#fff
style O1 fill:#1a1a2e,stroke:#808080,color:#fff
style T2 fill:#1a1a2e,stroke:#51cf66,color:#fff
style A2 fill:#1a1a2e,stroke:#e94560,color:#fff
style O2 fill:#1a1a2e,stroke:#808080,color:#fff
style T3 fill:#1a1a2e,stroke:#51cf66,color:#fff
style F fill:#1a1a2e,stroke:#51cf66,color:#fff
ReAct outperforms pure CoT on knowledge-intensive tasks because it can ground its reasoning in real data. On HotpotQA (multi-hop question answering), ReAct with GPT-4 achieves 35.1% exact match vs 29.4% for CoT alone. The real power is that reasoning errors get corrected by observations -- the model can update its plan mid-execution.
ReAct is the foundation of modern AI agents. Every agent framework (LangChain, CrewAI, AutoGen) implements some variant of the Thought-Action-Observation loop. You will build full agents in Phase 14. This lesson covers the prompting pattern.
Structured Prompting: XML Tags, Delimiters, Headers
As prompts get complex, structure prevents the model from confusing sections. Three approaches:
XML tags (works best with Claude, solid everywhere):
<context>
You are reviewing a pull request.
The codebase uses TypeScript and React.
</context>
<task>
Review the following diff for bugs, security issues, and style violations.
</task>
<diff>
{diff_content}
</diff>
<output_format>
List each issue with: file, line, severity (critical/warning/info), description.
</output_format>
Markdown headers (universal):
## Role
Senior security engineer at a fintech company.
## Task
Analyze this API endpoint for vulnerabilities.
## Input
{api_code}
## Rules
- Focus on OWASP Top 10
- Rate each finding: critical, high, medium, low
- Include remediation steps
Delimiters (minimal but effective):
---INPUT---
{user_text}
---END INPUT---
---INSTRUCTIONS---
Summarize the above in 3 bullet points.
---END INSTRUCTIONS---
Prompt Chaining: Sequential Decomposition
Some tasks are too complex for a single prompt. Prompt chaining breaks them into steps, where the output of one prompt becomes the input of the next.
graph LR
I["Raw Input"] --> P1["Prompt 1:\nExtract\nkey facts"]
P1 --> O1["Facts"]
O1 --> P2["Prompt 2:\nAnalyze\nfacts"]
P2 --> O2["Analysis"]
O2 --> P3["Prompt 3:\nGenerate\nrecommendation"]
P3 --> F["Final Output"]
style I fill:#1a1a2e,stroke:#808080,color:#fff
style P1 fill:#1a1a2e,stroke:#e94560,color:#fff
style O1 fill:#1a1a2e,stroke:#ffa500,color:#fff
style P2 fill:#1a1a2e,stroke:#e94560,color:#fff
style O2 fill:#1a1a2e,stroke:#ffa500,color:#fff
style P3 fill:#1a1a2e,stroke:#e94560,color:#fff
style F fill:#1a1a2e,stroke:#51cf66,color:#fff
Chaining beats single-prompt for three reasons:
- Each step is simpler: the model handles one focused task instead of juggling everything
- Intermediate outputs are inspectable: you can validate and correct between steps
- Different steps can use different models: use a cheap model for extraction, an expensive one for reasoning
Performance Comparison
| Technique | Best For | GSM8K Accuracy (GPT-5) | API Calls | Token Overhead | Complexity |
|---|---|---|---|---|---|
| Zero-Shot | Simple tasks | 94% | 1 | None | Trivial |
| Few-Shot | Format matching | 96% | 1 | 200-500 tokens | Low |
| Zero-Shot CoT | Quick reasoning boost | 97% | 1 | 50-200 tokens | Trivial |
| Few-Shot CoT | Maximum single-call accuracy | 98% | 1 | 300-600 tokens | Low |
| Self-Consistency (N=5) | High-stakes reasoning | 98.5% | 5 | 5x token cost | Medium |
| Reasoning model (o4-mini) | Drop-in CoT replacement | 97% | 1 | hidden (2-10x internal) | Trivial |
| Tree-of-Thought | Search/planning problems | N/A (74% on Game of 24) | 10-40+ | 10-40x token cost | High |
| ReAct | Knowledge-grounded reasoning | N/A (35.1% on HotpotQA) | 3-10+ | Variable | High |
| Prompt Chaining | Complex multi-step tasks | 96% (pipeline) | 2-5 | 2-5x token cost | Medium |
The right technique depends on three factors: accuracy requirement, latency budget, and cost tolerance. For most production systems, few-shot CoT with a 3-sample self-consistency fallback covers 90% of use cases.
Build It
We will build a math problem solver that combines few-shot prompting, chain-of-thought reasoning, and self-consistency voting into a single pipeline. Then we will add tree-of-thought for hard problems.
The full implementation is in code/advanced_prompting.py. Here are the key components.
Step 1: Few-Shot Example Store
The first component manages few-shot examples and selects the most relevant ones for a given problem.
GSM8K_EXAMPLES = [
{
"question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells every egg at the farmers' market for