Most prompt engineering guides list techniques without evidence. This one cites the original papers, benchmark numbers, and real performance data behind every method.
Twelve prompting techniques now have peer-reviewed research backing them. Each solves a different problem. Choosing wrong wastes tokens and gets worse results.
This guide covers every major technique from zero-shot to constitutional AI prompting. You'll get the research, the benchmarks, and copy-paste templates for each one.
What Are Prompt Engineering Techniques?
Prompt engineering techniques are structured methods for writing AI inputs that improve output quality. They range from adding examples to orchestrating multi-step reasoning chains.
The field began with Brown et al.'s 2020 GPT-3 paper at NeurIPS. That research proved large language models could learn tasks from examples in the prompt itself.
Since then, Google, Anthropic, Princeton, and others have published techniques that push accuracy 10–40% higher on reasoning benchmarks.
Not every technique works for every task. The Wharton School's 2025 Prompting Science Report found that chain of thought prompting adds negligible benefit on reasoning models that already think step-by-step. Matching technique to task matters more than memorizing every method.
Zero-Shot Prompting: The Baseline
Zero-shot prompting gives the model a task with no examples. You describe what you want. The model figures out the rest.
Brown et al. demonstrated this in their 2020 GPT-3 paper. GPT-3 achieved 81.5 F1 on CoQA reading comprehension with zero examples.
That number climbed to 85.0 F1 with few-shot examples — a modest but meaningful gain.
Modern models handle zero-shot far better than GPT-3 did. Claude, GPT-4, and Gemini are instruction-tuned, which means they follow directions without needing examples.
Use zero-shot when: the task is straightforward, the model is instruction-tuned, or you need to conserve tokens.
Classify the following customer email as one of:
Billing, Technical Support, Sales, or General Inquiry.
Respond with only the category name.
Email: "I can't log into my account after resetting my password."
Tip
Zero-shot works best for classification, summarization, and translation. Add examples only when zero-shot accuracy falls short.
Few-Shot Prompting: Teaching by Example
Few-shot prompting provides examples of correct input-output pairs inside the prompt. The model learns the pattern and applies it to new inputs.
Brown et al.'s GPT-3 paper proved this approach at NeurIPS 2020. GPT-3 achieved 71.2% accuracy on TriviaQA in the few-shot setting — up from 64.3% in zero-shot.
The jump was even larger on SuperGLUE. Eight examples performed comparably to fine-tuned BERT models trained on 630,000 examples.
The key insight: larger models benefit more from examples. Brown et al. found that the gap between zero-shot and few-shot performance grows with model size.
How many examples do you need?
Three to five examples hit the sweet spot for most tasks. More examples eat context window space without proportional accuracy gains.
Classify the sentiment of each product review.
Review: "This laptop is incredibly fast and lightweight."
Sentiment: Positive
Review: "Battery died after two months. Terrible quality."
Sentiment: Negative
Review: "It's okay for the price, nothing special."
Sentiment: Neutral
Review: "The camera quality blew me away on this phone."
Sentiment:
Classify this review — "The camera quality blew me away on this phone."
[Three labeled examples above] Review — "The camera quality blew me away on this phone." Sentiment:
For a deeper comparison of when to use each, see our zero-shot vs. few-shot guide.
Chain of Thought Prompting: Step-by-Step Reasoning
Chain of thought (CoT) prompting tells the model to show its reasoning before giving an answer. This single change unlocked complex reasoning in large language models.
Wei et al. published the foundational CoT paper at NeurIPS 2022 through Google Research. Their headline result: prompting PaLM 540B with eight chain-of-thought examples achieved state-of-the-art accuracy on the GSM8K math benchmark.
It surpassed even fine-tuned GPT-3 with a verifier. CoT improved performance across arithmetic, commonsense, and symbolic reasoning tasks.
Zero-shot CoT: The "Think Step by Step" Trick
Kojima et al. (2022) discovered something surprising. Adding "Let's think step by step" to a prompt — with no examples — improved reasoning performance.
This zero-shot variant works because large models already have latent reasoning abilities. The phrase activates them.
The nuance: Meincke, Mollick, et al.'s Wharton Prompting Science Report (2025) found that CoT benefits vary by model type. For non-reasoning models, CoT improves average performance.
For dedicated reasoning models like o1 and DeepSeek-R1, explicit CoT prompting adds negligible benefit. The reasoning is already built in.
Solve this step by step:
A store offers 25% off all items. An additional 10% loyalty
discount applies after the first discount. If a jacket
originally costs $200, what is the final price?
Think through each discount step before giving the answer.
For a complete breakdown of this technique, read our chain of thought prompting guide.
Tree of Thoughts: Exploring Multiple Paths
Tree of Thoughts (ToT) extends chain of thought by exploring multiple reasoning paths simultaneously. Instead of following one chain, the model generates several, evaluates them, and backtracks when needed.
Yao et al. introduced ToT at NeurIPS 2023 through Princeton and Google DeepMind. The framework uses search algorithms like breadth-first and depth-first search to navigate a tree of reasoning steps.
The performance gap is dramatic. On the Game of 24 benchmark, CoT prompting solved only 4% of problems.
ToT solved 74%. The difference comes from ToT's ability to try multiple approaches and abandon dead ends.
The tradeoff: ToT uses significantly more tokens and API calls. Each step generates multiple candidates, and each candidate gets evaluated. For simple tasks, this overhead isn't worth it.
Use ToT when: the problem has multiple valid solution paths, requires strategic planning, or involves constraint satisfaction like puzzles and scheduling.
Three experts will solve this problem independently.
Each expert shares their reasoning step by step.
If any expert realizes their approach won't work,
they backtrack and try a different path.
After all experts present their solutions,
they vote on the best answer.
Problem: Using the numbers 2, 3, 5, and 12 with basic
arithmetic operations (+, -, *, /), make the number 24.
Each number must be used exactly once.
Warning
ToT can cost 10-50x more tokens than standard prompting. Reserve it for high-stakes problems where accuracy matters more than cost.
Self-Consistency: Majority Vote Reasoning
Self-consistency generates multiple reasoning paths for the same question, then picks the answer that appears most often. Think of it as a reliability layer on top of chain of thought.
Wang et al. published this technique through Google Research in 2022. Their paper reported striking improvements: +17.9% on GSM8K, +11.0% on SVAMP, and +12.2% on AQuA.
Additional gains appeared on StrategyQA (+6.4%) and ARC-challenge (+3.9%).
The intuition is elegant. A complex problem usually has multiple valid reasoning paths that lead to the same correct answer. By sampling diverse paths and taking the majority vote, you filter out one-off reasoning errors.
Cost consideration: Self-consistency requires generating 5–10 responses per question. Wang et al. found diminishing returns beyond 10 samples.
I will solve this problem 5 different ways, then compare
the answers to find the most reliable one.
Problem: A train travels 120 km at 60 km/h, then 80 km
at 40 km/h. What is the average speed for the entire trip?
Approach 1: [solve using total distance / total time]
Approach 2: [solve by calculating each segment separately]
Approach 3: [solve using the harmonic mean formula]
...
Final answer: [most common answer across all approaches]
Tip
Self-consistency shines on math, logic, and multi-step reasoning. It's less useful for creative or open-ended tasks where multiple valid answers exist.
ReAct: Reasoning Plus Acting
ReAct combines chain-of-thought reasoning with the ability to take actions — like searching the web, querying databases, or calling APIs. The model alternates between thinking and acting.
Yao et al. (2022) introduced ReAct through Princeton University. The framework interleaves reasoning traces with task-specific actions. The model thinks about what it knows, decides what information it needs, takes an action to get it, then reasons about the result.
On the HotPotQA benchmark, ReAct outperformed pure acting (no reasoning) on both question-answering and fact-verification tasks. The authors found that combining ReAct with CoT and self-consistency outperformed all individual methods.
ReAct's real power is grounding. Standard prompting relies entirely on the model's training data, which can be outdated or incomplete. ReAct lets the model fetch current information during reasoning.
ReAct is the foundation of modern AI agents. Tools like LangChain, AutoGPT, and Claude's tool-use all implement variants of the Thought → Action → Observation loop that ReAct pioneered.
Answer the following question by reasoning step by step
and searching for information when needed.
Question: What was the GDP growth rate of India in 2025?
Thought 1: I need current economic data for India's 2025
GDP growth. My training data may be outdated.
Action 1: Search "India GDP growth rate 2025 official data"
Observation 1: [search results would appear here]
Thought 2: Based on the search results, I can now answer.
Answer: [final answer with source citation]
Info
ReAct requires tool integration to reach its full potential. In a standard chat interface, you can simulate the pattern — but real ReAct needs the model to call external APIs.
Meta-Prompting: Prompts That Write Prompts
Meta-prompting asks the AI to generate or improve prompts rather than performing the task directly. You instruct the model to write the best possible prompt for a given goal.
This technique leverages the model's understanding of what makes instructions effective. Zhou et al.'s 2022 paper "Large Language Models Are Human-Level Prompt Engineers" showed that AI-generated prompts can match or exceed human-written ones on benchmark tasks.
Meta-prompting works in two directions. Forward meta-prompting asks the model to create a prompt for a task. Reverse meta-prompting gives the model an output and asks it to infer what prompt would produce it.
You are a prompt engineering expert. Write the most
effective prompt for the following task:
Task: Get an AI to write a detailed product comparison
between two SaaS tools, including pricing, features,
pros/cons, and a recommendation.
Requirements for the prompt you write:
- Specify the output format clearly
- Include role assignment
- Request specific data points
- Set the appropriate tone and length
Write only the prompt, nothing else.
Compare Notion and Coda for me.
[Meta-prompt generates a detailed, structured prompt with role, format, criteria, and tone specifications]
SurePrompts' AI prompt generator automates meta-prompting. You describe what you need in plain English, and it builds a structured prompt with role, context, and format specifications.
Role and Persona Prompting: Setting the Expert
Role prompting assigns the model a specific identity, expertise level, and perspective before giving it a task. "You are a senior tax accountant" produces different output than "Answer this tax question."
The technique works because language models adjust their vocabulary, depth, and reasoning patterns based on the role they're given. A prompt assigning the "experienced pediatrician" role will use medical terminology appropriately and consider age-specific factors.
Persona prompting goes deeper than role assignment. It includes communication style, priorities, and constraints. A "startup CTO evaluating vendors" persona produces different analysis than a "Fortune 500 procurement officer" persona — even when asked the same question.
You are a senior cybersecurity analyst with 15 years of
experience in penetration testing and incident response.
You specialize in cloud infrastructure security for
financial services companies.
Analyze the following AWS architecture diagram for
security vulnerabilities. Prioritize findings by risk
level (Critical, High, Medium, Low). For each finding,
include: the vulnerability, potential impact, and
specific remediation steps.
[Architecture description here]
Tip
Stack roles with expertise levels for better results. "Senior data scientist specializing in NLP" outperforms "data scientist" on technical NLP tasks.
Prompt Chaining: Breaking Complex Tasks Apart
Prompt chaining splits a complex task into sequential steps, where each prompt's output feeds into the next one as input. Instead of asking one prompt to do everything, you build a pipeline.
The approach mirrors how humans handle complex work. A researcher doesn't write a paper in one sitting — they outline, draft sections, revise, and edit. Prompt chaining brings that same workflow to AI.
Prompt 1 — Research and gather key facts on the topic
Prompt 2 — Create an outline using the research output
Prompt 3 — Write each section based on the outline
Prompt 4 — Edit for clarity, accuracy, and tone
Prompt 5 — Generate a summary and headline options
Each step can use a different technique. Step 1 might use ReAct for research, and Step 3 might use role prompting for voice.
Step 4 might use self-consistency for quality checking.
Chaining also reduces hallucination. When one prompt handles everything, errors compound invisibly. With chains, you can verify each step's output before passing it forward.
# Step 1: Extract key data points
Extract all numerical claims, statistics, and dates
from the following article. Output as a numbered list.
[Article text]
# Step 2: Verify claims (separate prompt)
For each data point below, assess whether it is
plausible and consistent with publicly available data.
Flag any that seem incorrect or unverifiable.
[Output from Step 1]
# Step 3: Write summary (separate prompt)
Using only the verified data points below, write a
3-paragraph summary of the article's key findings.
[Verified output from Step 2]
For detailed implementation patterns, see our prompt chaining guide.
Constitutional AI Prompting: Built-In Guardrails
Constitutional AI (CAI) prompting gives the model a set of principles to self-evaluate and revise its own outputs. Instead of relying on human reviewers to catch problems, the model critiques itself.
Bai et al. introduced constitutional AI through Anthropic in December 2022. The core idea: give the model a "constitution" — a set of written rules — and have it critique, then revise, its own responses against those rules. The approach uses self-critique and revision without human-labeled harmful content.
The key benefit is scalability. Human review doesn't scale when models generate millions of responses daily. CAI lets the model enforce principles like helpfulness, harmlessness, and honesty autonomously.
As a prompting technique, you can apply constitutional principles to any model. Define your rules. Ask the model to generate, critique, and revise.
Generate a response to the user question below. Then
critique your response against these principles:
Principles:
1. Be helpful and directly answer the question
2. Acknowledge uncertainty — don't present guesses as facts
3. Avoid harmful, biased, or misleading content
4. Cite sources when making factual claims
5. Be concise — no unnecessary padding
User question: "What supplements should I take for anxiety?"
Step 1: Write your initial response.
Step 2: Critique the response against each principle.
Step 3: Write a revised response addressing the critique.
Warning
Constitutional prompting adds latency and tokens. Use it for high-stakes outputs — medical advice, legal content, financial recommendations — where self-checking prevents harm.
Structured Output Prompting: Controlling the Format
Structured output prompting constrains the model's response to a specific format — JSON, XML, Markdown tables, YAML, or custom schemas. This is essential for any application where AI output feeds into downstream code.
Without structure, parsing AI output becomes fragile string manipulation. With it, you get reliable, machine-readable data.
Modern models support structured outputs natively. OpenAI's API offers JSON mode and function calling, Claude supports tool use with defined schemas, and Gemini has structured output parameters.
Extract the following information from this job posting
and return it as valid JSON. Use null for any field not
found in the text.
{
"job_title": "string",
"company": "string",
"location": "string",
"salary_min": "number or null",
"salary_max": "number or null",
"experience_years": "number or null",
"remote_policy": "remote | hybrid | onsite | null",
"required_skills": ["string"],
"nice_to_have_skills": ["string"]
}
Job posting:
[paste job posting here]
Tip
Always provide an example of the exact output format you want. Models follow demonstrated structure more reliably than described structure.
System Prompts and Custom Instructions
System prompts set persistent instructions that govern every response in a conversation. They define the model's role, constraints, output format, and behavioral boundaries before the user says anything.
System prompts differ from regular prompts in scope. A regular prompt is a single instruction. A system prompt is an ongoing context that shapes every subsequent response.
Every major AI provider supports them. OpenAI uses the "system" role in its API, and Anthropic uses a dedicated system parameter.
Custom GPTs and Claude Projects both let non-technical users set persistent instructions.
What belongs in a system prompt?
Effective system prompts cover identity, constraints, and format. They answer: Who are you? What should you never do? How should you format responses?
You are a senior technical writer for a developer
documentation platform. Your audience is experienced
software engineers.
Rules:
- Use precise technical language
- Include code examples in every explanation
- Use Python for examples unless asked otherwise
- Maximum 3 sentences per paragraph
- Never say "simply" or "just" — respect complexity
- When uncertain, say so rather than guessing
- Format all responses in Markdown
When asked about API endpoints, always include:
method, URL path, request body schema, and
response body schema with example values.
Info
System prompts have the highest priority in the model's attention. Place your most critical instructions there, not in the user message.
Choosing the Right Technique: A Decision Framework
No single technique wins everywhere. The right choice depends on task complexity, accuracy requirements, and budget.
| Technique | Best For | Token Cost | Accuracy Gain |
|---|---|---|---|
| Zero-shot | Simple, clear tasks | Low | Baseline |
| Few-shot | Pattern-matching tasks | Medium | +5-10% |
| Chain of Thought | Multi-step reasoning | Medium | +10-18% |
| Tree of Thoughts | Strategic planning, puzzles | Highest | +20-70% |
| Self-Consistency | Math, logic problems | High (5-10x) | +12-18% |
| ReAct | Tasks needing current data | Medium-High | Varies |
| Meta-Prompting | Prompt optimization | Medium | Indirect |
| Role Prompting | Domain-specific tasks | Low | +5-15% |
| Prompt Chaining | Complex multi-step workflows | High | +10-30% |
| Constitutional AI | Safety-critical outputs | High | Safety-focused |
| Structured Output | Code/data integration | Low | Format reliability |
| System Prompts | Consistent behavior | Low | Consistency |
Quick Decision Tree
Is the task simple and well-defined? Start with zero-shot. Add few-shot examples if accuracy is insufficient.
Does the task require reasoning? Use chain of thought. If the stakes are high, add self-consistency.
Does the task need exploration or planning? Use Tree of Thoughts.
Does the model need external information? Use ReAct or prompt chaining with tool access.
Is the output going into code? Use structured output prompting.
Does the task need safety guardrails? Layer constitutional AI principles on top.
Combining Techniques for Maximum Impact
The most effective prompt engineers combine techniques. Research consistently shows that hybrid approaches outperform any single method.
Yao et al.'s ReAct paper found that combining ReAct with CoT and self-consistency outperformed all individual prompting methods on knowledge-intensive tasks. Wang et al. showed that self-consistency layered on top of CoT boosted GSM8K performance by 17.9% over CoT alone.
A Real-World Stack
Here's how a production system might combine techniques for a complex research task:
# System prompt (persistent context)
You are a senior market research analyst at a
Fortune 500 consulting firm.
# Role prompting + Chain of thought + Structured output
Analyze the competitive landscape for [product category].
Think through your analysis step by step:
1. Identify the top 5 competitors
2. Evaluate each on pricing, features, and market share
3. Identify gaps and opportunities
Output your analysis as a JSON object with this schema:
{
"competitors": [...],
"market_gaps": [...],
"recommendation": "string"
}
You can build prompts that combine any of these techniques using the SurePrompts prompt builder. It handles role assignment, format specification, and context framing automatically.
The Research Behind These Techniques
Every technique in this guide traces back to published research. Here are the foundational papers:
| Technique | Paper | Authors | Year |
|---|---|---|---|
| Few-shot | Language Models are Few-Shot Learners | Brown et al. | 2020 |
| Chain of Thought | CoT Prompting Elicits Reasoning in LLMs | Wei et al. | 2022 |
| Self-Consistency | Self-Consistency Improves CoT Reasoning | Wang et al. | 2022 |
| ReAct | ReAct: Synergizing Reasoning and Acting | Yao et al. | 2022 |
| Tree of Thoughts | Tree of Thoughts: Deliberate Problem Solving | Yao et al. | 2023 |
| Constitutional AI | Constitutional AI: Harmlessness from AI Feedback | Bai et al. | 2022 |
The field moves fast. Meincke and Mollick's 2025 Wharton report found that CoT's value has decreased for reasoning-native models. Techniques that were breakthrough in 2022 may be built into model architectures by 2026.
Stay current. What works today may be redundant tomorrow as models evolve.
Frequently Asked Questions
What is the most effective prompt engineering technique?
Chain of thought combined with self-consistency produces the highest accuracy on reasoning tasks. Wang et al.'s 2022 research showed +17.9% improvement on GSM8K when combining these two techniques. For non-reasoning tasks, few-shot prompting often suffices.
Do I need to use advanced techniques with modern models?
Not always. Meincke and Mollick's 2025 Wharton study found that reasoning models like o1 gain negligible benefit from explicit CoT prompting. The reasoning is already built into the model. Test zero-shot first — add complexity only when results fall short.
How many few-shot examples should I include?
Three to five examples work for most tasks. Brown et al.'s GPT-3 research showed that performance improves with each example, but gains diminish after 5-8. More examples consume context window space without proportional accuracy improvement.
What's the difference between chain of thought and tree of thoughts?
Chain of thought follows a single reasoning path. Tree of thoughts explores multiple paths and can backtrack.
CoT is linear; ToT is branching. ToT excels when problems have multiple valid solution strategies.
When should I use prompt chaining vs. a single prompt?
Use prompt chaining when the task has distinct phases (research → outline → draft → edit). Use a single prompt when the task is cohesive and doesn't exceed the model's context window. Chaining reduces hallucination by letting you verify intermediate outputs.
Can I combine multiple prompting techniques?
Yes — and you should for complex tasks. Layer role prompting with chain of thought for domain-specific reasoning. Add self-consistency for reliability.
Use structured output for machine-readable results. The most effective production systems combine 2-3 techniques.
How does prompt engineering change with reasoning models?
Reasoning models like o1 and DeepSeek-R1 internalize step-by-step thinking. Explicit CoT prompts can hurt performance by conflicting with built-in reasoning. Focus on clear task specification, structured output, and role context.
What's the cheapest way to improve prompt performance?
Start with role prompting — it costs zero additional tokens. Then try few-shot examples (3-5).
These two low-cost techniques solve most quality issues. Reserve self-consistency and ToT for problems requiring high accuracy.