Skip to main content

Prompt Engineering Glossary

Essential terms and concepts every prompt engineer should know. Browse 211 key definitions with examples and practical tips.

Chain of Thought Prompting
Chain of thought prompting is a technique that encourages an AI model to break down complex reasoning into sequential, intermediate steps before arriving at a final answer.
Context Window
A context window is the maximum amount of text (measured in tokens) that an AI model can process in a single interaction, including both the input prompt and the generated output.
Few-Shot Prompting
Few-shot prompting is a technique where you provide the AI model with a small number of examples (typically 2-5) within the prompt to demonstrate the desired format, style, or reasoning pattern.
Fine-Tuning
Fine-tuning is the process of further training a pre-trained AI model on a specific dataset to specialize its behavior for particular tasks or domains.
Grounding
Grounding is the practice of anchoring AI responses to specific, verifiable sources of information such as documents, databases, or real-time data.
Hallucination
A hallucination occurs when an AI model generates information that sounds plausible but is factually incorrect, fabricated, or unsupported by its training data.
In-Context Learning
In-context learning is the ability of a large language model to learn and adapt its behavior based on examples or instructions provided directly within the prompt, without any changes to the model's underlying weights.
Instruction Tuning
Instruction tuning is a training technique where a pre-trained language model is further trained on a curated dataset of instruction-response pairs to improve its ability to follow natural language instructions.
Letta
Letta is an open-source stateful agent framework where the agent itself manages its memory via tool calls.
Large Language Model (LLM)
A large language model (LLM) is an AI system trained on massive amounts of text data that can understand, generate, and reason about natural language.
Multi-Agent System
A multi-agent system is a system in which two or more LLM-driven agents collaborate on a task, typically by exchanging messages, handing off ownership, or being orchestrated by a higher-level coordinator.
Multi-Modal AI
Multi-modal AI refers to artificial intelligence systems that can process and generate content across multiple types of data — such as text, images, audio, and video — within a single model.
Negative Prompting
Negative prompting is a technique where you explicitly tell the AI model what to avoid, exclude, or not do in its response.
Persona Prompting
Persona prompting is a technique where you ask the AI to adopt a specific identity, personality, or character to shape the tone, vocabulary, and perspective of its responses.
Prompt Chaining
Prompt chaining is a strategy where you break a complex task into a sequence of simpler prompts, feeding the output of one step as input to the next.
Prompt Engineering
Prompt engineering is the practice of designing, refining, and optimizing the text inputs (prompts) given to AI models to elicit the most useful, accurate, and relevant outputs.
Prompt Injection
Prompt injection is a security vulnerability where a malicious user crafts input that overrides or manipulates the AI model's original instructions, causing it to ignore its guidelines or perform unintended actions.
Prompt Template
A prompt template is a reusable, pre-structured prompt with placeholder variables that can be filled in with specific details for each use.
Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation (RAG) is an architecture that enhances AI model responses by first retrieving relevant information from an external knowledge base and then including that information in the prompt for the model to reference.
Role Prompting
Role prompting is a technique where you assign the AI model a specific professional role or area of expertise to shape the depth, vocabulary, and perspective of its responses.
Self-Consistency
Self-consistency is a prompting strategy where you generate multiple responses to the same question using chain-of-thought reasoning, then select the most common answer among them.
System Prompt
A system prompt is a special set of instructions provided to an AI model before the user's message that defines the model's behavior, personality, constraints, and response format for the entire conversation.
Temperature
Temperature is a parameter that controls the randomness and creativity of an AI model's output.
Token
A token is the basic unit of text that AI models use to process and generate language.
Top-P (Nucleus Sampling)
Top-P, also known as nucleus sampling, is a parameter that controls which tokens the model considers when generating each word.
Tree of Thought Prompting
Tree of thought prompting is an advanced reasoning technique where the AI model explores multiple branching solution paths simultaneously, evaluates each branch, and backtracks from dead ends before selecting the best path to the answer.
Zero-Shot Prompting
Zero-shot prompting is the simplest prompting approach where you give the AI model a task instruction without providing any examples.
Agent Graph
An agent graph is a representation of an agentic LLM application as a directed graph of nodes (work units, often LLM calls or tools) connected by edges (transitions, often conditional on state).
Agent Handoff
An agent handoff is a pattern in multi-agent systems where one agent transfers control of the conversation or task to another agent — passing along context but ceding ownership of the loop.
Agent Orchestration
Agent orchestration is the practice of designing how agents, tools, and state interact across a multi-step task.
Agent Tool Loop
An agent tool loop is the canonical agentic execution pattern: the model receives a goal, optionally calls a tool, observes the result, and decides whether to call another tool or finish.
Agentic AI
Agentic AI refers to AI systems that can autonomously plan, execute, and iterate on multi-step tasks with minimal human intervention.
Tool Use (Function Calling)
Tool use, also called function calling, is the ability of an AI model to invoke external tools, APIs, or functions during a conversation to perform actions beyond text generation.
Mastra
Mastra is an open-source TypeScript framework for building AI agents and workflows.
Model Context Protocol (MCP)
The Model Context Protocol (MCP) is an open standard developed by Anthropic that provides a universal way to connect AI models to external data sources, tools, and services.
Reasoning Model
A reasoning model is an AI system specifically trained to perform extended, step-by-step thinking before producing a final answer.
AI Guardrails
AI guardrails are safety mechanisms, rules, and constraints built into AI systems to prevent harmful, biased, or undesired outputs.
Structured Output
Structured output refers to AI model responses that follow a specific, machine-readable format such as JSON, XML, CSV, or a defined schema.
Context Caching
Context caching is an optimization technique where AI providers store and reuse previously processed prompt prefixes across multiple API calls.
Embedding
An embedding is a numerical vector representation of text that captures its semantic meaning in a high-dimensional space.
Vector Database
A vector database is a specialized database designed to store, index, and efficiently query high-dimensional embedding vectors.
Prompt Optimization
Prompt optimization is the systematic process of iteratively refining prompts to improve the quality, accuracy, and consistency of AI model outputs.
AI Alignment
AI alignment is the field of research and practice focused on ensuring that AI systems behave in accordance with human values, intentions, and goals.
Knowledge Cutoff
A knowledge cutoff is the date beyond which an AI model has no training data, meaning it cannot answer questions about events, discoveries, or changes that occurred after that point.
Inference
Inference is the process of using a trained AI model to generate predictions or outputs from new inputs.
Beam Search
Beam search is a decoding strategy that explores multiple candidate output sequences simultaneously during text generation, keeping the top-k most probable sequences (the "beam width") at each step.
Tokenizer
A tokenizer is the component that converts raw text into a sequence of tokens (numerical IDs) that an AI model can process, and converts model output tokens back into readable text.
Attention Mechanism
An attention mechanism is a neural network component that allows a model to dynamically weigh the importance of different parts of the input when generating each part of the output.
Transformer
A transformer is the neural network architecture that powers virtually all modern large language models, including GPT, Claude, Gemini, and LLaMA.
Prompt Caching
Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests.
Constitutional AI
Constitutional AI (CAI) is a training methodology developed by Anthropic where an AI model is guided by a set of written principles (a "constitution") to self-critique and revise its own outputs during training.
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning from human feedback (RLHF) is a training method where human evaluators rank or score multiple AI outputs, and those preferences are used to train a reward model that further fine-tunes the language model.
Chain of Verification
Chain of verification (CoVe) is a prompting technique where the AI model first generates an initial response, then creates specific verification questions about its own claims, answers those questions independently, and finally revises the original response based on the verification results.
mem0
mem0 is an open-source memory layer that adds persistent memory to any LLM application via four primitives: add, search, update, delete.
Memory Block
A memory block is a labeled, persistent chunk of agent memory directly editable by the agent via tool calls.
Memory Recall
Memory recall is the retrieval step in agent memory: surfacing relevant past memory into the current context window so the model can use it.
Meta-Prompting
Meta-prompting is the practice of using an AI model to generate, refine, or optimize prompts for other AI tasks.
AI Agent
An AI agent is a software system that uses a large language model as its reasoning core to autonomously plan, execute, and adapt multi-step workflows using external tools and data sources.
Few-Shot Chain of Thought
Few-shot chain of thought is a prompting technique that combines few-shot examples with explicit step-by-step reasoning demonstrations.
Prompt Leaking
Prompt leaking is an attack technique where a user crafts inputs designed to trick an AI model into revealing its hidden system prompt or confidential instructions.
AI Hallucination Detection
AI hallucination detection encompasses the methods, tools, and techniques used to identify when an AI model generates false, fabricated, or unsupported information.
Model Distillation
Model distillation is a technique for creating a smaller, more efficient "student" model that approximates the behavior of a larger "teacher" model.
Swarm
Swarm is an experimental cookbook framework released by OpenAI in 2024 that demonstrated lightweight multi-agent patterns — primarily handoffs and shared context — without the production hardening of a full SDK.
SWE-Bench
SWE-Bench is an evaluation benchmark from Princeton and the University of Washington that measures an AI agent's ability to resolve real GitHub issues by pro...
Tau-bench
Tau-bench is an agent evaluation benchmark that tests tool-use accuracy across multi-turn customer-service-style tasks.
OSWorld
OSWorld is an agent evaluation benchmark for desktop and browser computer-use tasks.
Function-Calling Accuracy
Function-calling accuracy is how often a model correctly picks the right tool, passes valid arguments, and respects schema constraints when given a function-calling interface.
Synthetic Data
Synthetic data is artificially generated data created by AI models or algorithmic processes rather than collected from real-world events.
Data Poisoning
Data poisoning is an adversarial attack that corrupts an AI model's training data to manipulate its behavior in targeted ways.
Jailbreaking
Jailbreaking refers to techniques used to bypass an AI model's built-in safety restrictions, content policies, and behavioral guidelines to produce outputs the model was trained to refuse.
Semantic Search
Semantic search is an information retrieval approach that finds results based on the meaning of a query rather than exact keyword matches.
Prompt Versioning
Prompt versioning is the practice of tracking changes to prompts over time using version control principles — assigning version identifiers, recording modifications, and maintaining a history of prompt iterations.
OpenAI Agents SDK
The OpenAI Agents SDK is OpenAI's official Python framework for building production-grade agents.
Output Parsing
Output parsing is the process of extracting structured, machine-readable data from an AI model's free-form text responses.
LangGraph
LangGraph is an open-source Python library from the LangChain team for building stateful, multi-actor LLM applications as graphs.
Latent Space
Latent space is the high-dimensional internal representation space where AI models encode the meaning, relationships, and features of input data as numerical vectors.
Zero-Shot Chain of Thought
Zero-shot chain of thought is a prompting technique where you append a simple phrase like "Let's think step by step" to a question without providing any reasoning examples.
Prompt Compression
Prompt compression encompasses techniques for reducing the length of a prompt while preserving its essential meaning and effectiveness.
AI Safety
AI safety is the interdisciplinary field focused on ensuring that AI systems behave as intended, remain under human control, and do not cause unintended harm.
Red Teaming
Red teaming in AI is the practice of systematically probing an AI system for vulnerabilities, failure modes, and harmful behaviors through adversarial testing.
Benchmark
A benchmark in AI is a standardized test suite with predefined tasks, datasets, and evaluation metrics used to measure and compare model performance.
Perplexity
Perplexity is a standard metric for evaluating how well a language model predicts a sequence of text.
Logits
Logits are the raw, unnormalized numerical scores that a language model assigns to each token in its vocabulary as the potential next token.
Sampling
Sampling is the process of selecting the next token from the probability distribution a language model produces at each generation step.
Stop Sequence
A stop sequence is a predefined token, string, or pattern that signals the AI model to immediately stop generating text when encountered in the output.
JSON Mode
JSON mode is a model configuration setting that constrains the AI's output to be valid, parseable JSON.
Vector Memory
Vector memory is agent memory stored as embedding vectors in a vector database, retrieved by semantic similarity.
Vibe Coding
Vibe coding is a term popularized by Andrej Karpathy in early 2025 for a mode of working with AI coding agents in which the developer iterates by describing ...
Vision-Language Model (VLM)
A vision-language model (VLM) is an AI system that can process, understand, and reason about both visual inputs (images, screenshots, diagrams) and text simultaneously within a single model architecture.
Function Calling
Function calling is an AI model capability where the model analyzes a user's prompt and generates structured JSON specifying which external function to invoke and what arguments to pass.
Terminal-Bench
Terminal-Bench is an evaluation benchmark for AI agents that measures their ability to complete long-horizon, multi-step shell tasks — git operations, build ...
Test-Time Compute
Test-time compute is the practice of allocating additional computational resources during inference — when the model generates a response — rather than during training.
AI IDE
An AI IDE is a development environment in which an AI agent is the primary or co-equal interface for writing and editing code, rather than an autocomplete sidecar layered on top of a traditional editor.
AI Overview
An AI Overview is an AI-generated summary box that appears at the top of Google search results, synthesizing information from multiple web sources to answer a user's query directly.
Generative Engine Optimization (GEO)
Generative engine optimization (GEO) is the practice of structuring and enhancing content so that AI-powered platforms — like ChatGPT, Perplexity, and Google AI Overviews — cite, reference, or recommend it when generating responses.
Aider Polyglot
Aider Polyglot is a multi-language coding benchmark, originated by the Aider open-source project, that evaluates an AI agent's ability to satisfy hidden test...
Answer Engine Optimization (AEO)
Answer engine optimization (AEO) is a content strategy focused on structuring web content to appear as direct answers in featured snippets, People Also Ask boxes, voice search results, and AI-generated summaries.
Thinking Model
A thinking model is an AI system that uses extended inference-time computation to reason through problems before producing a final answer.
Prompt Routing
Prompt routing is the practice of automatically directing each user prompt to the most suitable AI model based on task type, complexity, and cost constraints.
Multimodal Prompting
Multimodal prompting is the practice of combining multiple input types — such as text, images, audio, or video — within a single prompt to give an AI model richer context for its response.
Prompt Tuning
Prompt tuning is a parameter-efficient technique that adapts a large language model to specific tasks by training small learnable vectors called "soft prompts" that are prepended to the input.
Instruction Following
Instruction following is an AI model's ability to accurately understand and execute explicit directions given in a prompt — including format requirements, length constraints, tone specifications, and multi-step procedures.
Cline
Cline is an open-source autonomous coding agent that runs as a Visual Studio Code extension; the project was originally released as Claude Dev before adopting its current name.
Code Interpreter
A code interpreter is an AI capability that allows a model to write and execute code — typically Python — in a sandboxed environment to solve analytical, mathematical, or data processing tasks.
Deep Research
Deep research is an AI capability where the model autonomously conducts multi-step web research to produce comprehensive, sourced reports on complex topics.
Mixture of Experts (MoE)
Mixture of experts (MoE) is a neural network architecture that divides a model into many specialized sub-networks called "experts" and uses a routing mechanism to activate only a small subset of them for each input.
Knowledge Graph
A knowledge graph is a structured database that represents real-world entities (people, places, concepts) and the relationships between them as an interconnected network of nodes and edges.
Quantization
Quantization is a technique that reduces an AI model's numerical precision — for example, converting 16-bit floating-point weights to 4-bit integers — to shrink the model's memory footprint and speed up inference.
LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adapts a pre-trained AI model to new tasks by injecting small trainable matrices into the model's layers while keeping the original weights frozen.
Prompt Ensembling
Prompt ensembling is a technique that runs multiple variations of a prompt for the same task and combines their outputs to produce a more accurate and robust final result.
Self-Reflection
Self-reflection is a prompting technique where an AI model evaluates, critiques, and improves its own output in one or more follow-up steps.
Direct Preference Optimization (DPO)
Direct preference optimization (DPO) is a training technique that aligns AI models with human preferences by learning directly from pairs of preferred and rejected outputs — without needing a separate reward model.
KV-Cache
A KV-cache (key-value cache) stores the computed attention key and value matrices from previously processed tokens so the model does not need to recalculate them when generating each new token.
AI Watermarking
AI watermarking is the practice of embedding hidden, machine-detectable patterns into AI-generated content — text, images, audio, or video — so that the content can later be identified as AI-produced.
Prompt Injection Defense
Prompt injection defense refers to the techniques and strategies used to protect AI systems from prompt injection attacks, where malicious inputs attempt to override the model's original instructions.
Context Stuffing
Context stuffing is the technique of loading relevant information — documents, data, or examples — directly into an AI model's prompt to give it the knowledge needed to answer accurately.
Model Collapse
Model collapse is a phenomenon where AI models progressively degrade when trained on data generated by other AI models rather than human-created content.
Autonomous Agent
An autonomous agent is an AI system that can independently plan, decide, and execute multi-step tasks to achieve a goal with minimal human oversight.
Benchmark Contamination
Benchmark contamination occurs when an AI model's training data accidentally or deliberately includes questions and answers from the benchmark tests used to evaluate it.
Emergent Behavior
Emergent behavior in AI refers to capabilities that appear unexpectedly in large language models as they scale up in size, without being explicitly programmed or trained for those tasks.
Catastrophic Forgetting
Catastrophic forgetting is a phenomenon where a neural network rapidly loses previously learned knowledge when it is trained on new data or tasks.
Few-Shot Learning
Few-shot learning is a machine learning approach where a model learns to perform a new task from only a handful of training examples — sometimes as few as one to five.
Transfer Learning
Transfer learning is a machine learning technique where a model trained on one task or dataset is reused as the starting point for a different but related task.
Semantic Similarity
Semantic similarity is a measure of how close two pieces of text are in meaning, regardless of whether they share the same words.
Grok
Grok is the family of conversational AI models built by xAI, distinguished from other major assistants by its real-time access to posts on X (formerly Twitte...
Working Memory
Working memory is short-term active memory that holds the current task context.
xAI
xAI is the artificial intelligence research company founded by Elon Musk in 2023.
Context Engineering
Context engineering is the discipline of deliberately assembling everything an AI model sees at inference time — system prompt, retrieved documents, conversa...
ReAct Prompting
ReAct prompting is a technique that interleaves Reasoning and Acting: the model writes a short reasoning trace about what to do next, takes an action (typica...
SurePrompts Quality Rubric
The SurePrompts Quality Rubric is a 7-dimension scoring framework for evaluating prompt quality: role clarity, context sufficiency, instruction specificity, format structure, example quality, constraint tightness, and output validation.
RCAF Prompt Structure
RCAF is a 4-part prompt skeleton — Role, Context, Action, Format — for drafting maintainable AI prompts.
Context Engineering Maturity Model
The Context Engineering Maturity Model is a 5-level framework for describing how sophisticated a team's context assembly practice is.
Agentic Coding
Agentic coding is the umbrella term for autonomous, multi-step coding workflows in which an LLM-driven agent plans, executes (file edits, shell commands, tes...
Agentic Prompt Stack
The Agentic Prompt Stack is a 6-layer model for designing prompts that run AI agents: Goals, Tool permissions, Planning scaffold, Memory access, Output validation, and Error recovery.
Extended Thinking
Extended thinking is a Claude feature that lets the model allocate additional reasoning tokens before producing its final answer, with a user-controllable thinking budget set per request.
Computer Use
Computer use is an Anthropic capability in which Claude controls a virtual computer via screenshots and keyboard/mouse actions.
Voice Prompting
Voice prompting is the practice of writing prompts for realtime voice and audio AI interfaces — speech-to-speech systems, voice agents, and realtime APIs — w...
Semantic Caching
Semantic caching is a pattern for caching LLM responses keyed by meaning similarity rather than exact prompt match.
LLM-as-Judge
LLM-as-judge is an evaluation pattern in which an LLM scores the outputs of another model against a rubric.
Plan-and-Execute Prompting
Plan-and-execute prompting is a two-phase agent pattern.
Self-Refine Prompting
Self-refine prompting is an iterative pattern in which the model generates an output, critiques its own output against specified criteria, then produces a revised version.
Reflexion Prompting
Reflexion is an agent prompting pattern in which, after a failed attempt, the agent generates a short verbal reflection on what went wrong and uses that reflection as additional context for its next attempt.
Skeleton of Thought
Skeleton of thought is a reasoning pattern in which the model first produces a compact skeleton of the answer — a list of points or an outline — and then expands each skeleton point, often as independent sub-prompts running in parallel.
Tool Choice
Tool choice is an API parameter on modern tool-calling models that controls whether and how the model selects a tool.
Chain of Density
Chain of density is a summarization technique in which the model iteratively rewrites a summary, each pass adding more salient entities while keeping total length constant.
Procedural Memory
Procedural memory is memory of how to do something — implicit knowledge tied to learned routines and skills.
Program of Thoughts
Program of thoughts is a reasoning technique in which the model generates code — typically Python — to solve a numerical or logical problem, then executes the code to obtain the answer.
Self-Ask Prompting
Self-ask prompting is a reasoning pattern in which the model explicitly asks itself follow-up questions before answering a composite question.
ReWOO (Reasoning WithOut Observation)
ReWOO is an agent architecture that separates planning from execution.
RAFT (Retrieval-Augmented Fine-Tuning)
RAFT is a training technique that combines retrieval-augmented generation with fine-tuning.
Episodic Memory
Episodic memory is memory of specific events tied to time and context — "what happened when, where, and with whom." The term comes from cognitive science (Tu...
Eval Harness
An eval harness is infrastructure that runs a prompt or model against a fixed test set and computes aggregate scores per metric.
Golden Set
A golden set is a curated collection of input-output pairs that represent the correct behavior for a given task.
Indirect Prompt Injection
Indirect prompt injection is a security vulnerability in which malicious instructions are embedded in content the model retrieves — a web page, email, PDF, or database row — rather than typed by the end user.
DSPy
DSPy is a programming framework, originally from Stanford, that treats prompts as functions with typed signatures rather than strings.
Prompt Observability
Prompt observability is the operational practice of logging, tracing, and monitoring prompt inputs, outputs, and model behavior in production.
Least-to-Most Prompting
Least-to-most prompting is a reasoning pattern in which the model first decomposes a complex problem into an ordered sequence of easier sub-problems, then so...
Auto-CoT (Automatic Chain of Thought)
Auto-CoT is a method for generating chain-of-thought demonstrations automatically rather than hand-writing them.
Step-Back Prompting
Step-back prompting is a technique in which the model first generates a higher-level abstraction, principle, or generalization — a "step back" from the speci...
Active Prompting
Active prompting is an adaptive approach to few-shot example selection that borrows from active learning.
Chain of Code
Chain of Code is a hybrid reasoning pattern in which the model produces a trace that interleaves executable code with natural-language "pseudocode" comments.
Self-Debug Prompting
Self-debug prompting is a pattern in which the model generates code, an interpreter executes it, and the model receives the execution result — error messages, failed test output, or unexpected values — as additional context for a revised attempt.
Model Cascade
A model cascade is a routing pattern in which each request is first attempted by a cheaper, smaller model and only escalated to a stronger, more expensive mo...
Model Routing
Model routing is the practice of dispatching different requests to different language models based on task classification, cost target, or expected reasoning depth.
Cost Per Task
Cost per task is the total cost — including input tokens, output tokens, tool-call overhead, and retry rate — to complete one unit of useful work with a language model.
Prefix-Tuning
Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen.
RAGAS
RAGAS is an open-source evaluation framework for retrieval-augmented generation systems.
Structured Decoding
Structured decoding is an inference-time technique that constrains the model's output to conform to a grammar, regular expression, or JSON schema by masking invalid tokens at each generation step.
Chunking
Chunking is the process of splitting source documents into smaller pieces before they are embedded and indexed for retrieval.
Reranking
Reranking is a secondary scoring pass over an initial set of retrieval candidates to improve their ordering before they are handed to the generator.
HyDE (Hypothetical Document Embeddings)
HyDE is a retrieval technique in which the language model first generates a hypothetical answer to the user's query, and then that hypothetical answer — not ...
Embedding Model
An embedding model is a machine-learning model that maps text (or images, audio, code) to a fixed-dimensional vector such that semantically similar inputs la...
Hybrid Search
Hybrid search is a retrieval technique that combines keyword-based search — typically BM25 over an inverted index — with vector-based semantic search, and fu...
Query Rewriting
Query rewriting is a retrieval preprocessing step that transforms the user's question before it is sent to the retriever.
RLAIF (Reinforcement Learning from AI Feedback)
RLAIF is a training technique that uses AI-generated preferences — typically from a strong LLM acting as a judge — to guide reinforcement-learning fine-tunin...
Mixture of Prompts
Mixture of prompts is an ensembling pattern where the same input is run through several different prompts and the resulting outputs are combined — by majorit...
Self-Critique Prompting
Self-critique prompting is a pattern where the model is asked to evaluate its own output against specific criteria, surface weaknesses, and suggest improvements — but deliver the critique as an output, not a rewrite.
GraphRAG
GraphRAG is a retrieval-augmented-generation variant that builds a knowledge graph from the source corpus — extracting entities, relationships, and community clusters — and uses the graph structure as retrieval context alongside or in place of raw document chunks.
Agentic RAG
Agentic RAG is a pattern where retrieval is treated as a tool call inside an agent loop rather than as a fixed first step in a linear pipeline.
Conversation Memory
Conversation memory is memory scoped to a single conversation or session — the running context of the current dialogue.
Corrective RAG (CRAG)
Corrective RAG is a 2024 retrieval pattern that adds a relevance-grading step between retrieval and generation: every retrieved document is scored by a lightweight evaluator for how well it answers the query, and the pipeline branches on the aggregate confidence.
Self-RAG
Self-RAG is a pattern in which the language model emits special reflection tokens that control its own retrieval and generation decisions.
Contextual Compression
Contextual compression is a preprocessing step that sits between retrieval and generation in a RAG pipeline.
Parent-Document Retrieval
Parent-document retrieval is a chunking-and-retrieval pattern that separates the unit used for matching from the unit used for generation.
Semantic Memory
Semantic memory is memory of general facts independent of when or how they were learned.
Semantic Router
A semantic router is an embedding-based routing layer that classifies an incoming query to one of several downstream prompts, agents, tools, or models by com...
Multimodal RAG
Multimodal RAG is a retrieval-augmented-generation variant in which the indexed corpus and the retrieval step span multiple modalities — text, images, tables, figures, audio, or video — not just plain text.
Long-Term Memory (Agent Memory)
Long-term memory is a persistent store that gives an agent access to information across sessions — user preferences, prior decisions, past tool results worth remembering, or accumulated background about a project.
Context Rot
Context rot is the degradation of model performance as a context window fills up with more content.
Document AI (Layout-Aware Parsing)
Document AI refers to techniques and services for extracting structured content from complex documents — layout, reading order, tables, figures, forms, handw...
BM25
BM25 is the dominant sparse-retrieval algorithm and the default scoring function in Elasticsearch, Lucene, OpenSearch, and most Postgres full-text setups.
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion is a technique for merging several ranked result lists — produced by different retrievers over the same corpus — into a single unified ranking.
Contextual Retrieval
Contextual Retrieval is a technique introduced by Anthropic in 2024 that prepends a short chunk-specific context summary to each chunk before it is embedded ...
CrewAI
CrewAI is an open-source Python framework for building multi-agent systems based on the role/goal/backstory metaphor.
Cross-Encoder
A cross-encoder is a transformer architecture that takes a query and a candidate document as a single joint input — typically concatenated with a separator t...
Bi-Encoder
A bi-encoder is a dual-tower transformer architecture in which the query and the document are encoded independently by the same (or twin) encoder into separa...
CodeAct
CodeAct is a pattern, formalized in a 2024 paper by Wang et al.
Coding Agent
A coding agent is an LLM system specialized for software-engineering tasks — reading code, editing files, running tests, executing shell commands, and iterating on results until a task is complete.
ColBERT (Late Interaction Retrieval)
ColBERT is a retrieval architecture that sits between bi-encoders and cross-encoders.
Many-Shot Jailbreaking
Many-shot jailbreaking is a long-context attack pattern identified by Anthropic researchers in 2024.
Multi-Query Retrieval
Multi-query retrieval is a RAG pattern that hedges against the single-query-phrasing failure mode of standard retrieval.
Needle in a Haystack
Needle in a haystack is a long-context evaluation pattern that measures whether a model can retrieve a specific fact (the needle) planted at an arbitrary pos...
Lost in the Middle
Lost in the middle is the finding from Liu et al.
RULER (Long-Context Benchmark)
RULER is a long-context evaluation that goes beyond simple needle-in-a-haystack retrieval.
Text to Speech (TTS)
Text to speech, or TTS, is the synthesis of spoken audio from written text.
Spec-Driven Development
Spec-driven development is a workflow in which a written specification — acceptance criteria, edge cases, interfaces, validation rules, and explicit non-goal...
Speech to Text (STT)
Speech to text, or STT — also called automatic speech recognition (ASR) — is the transcription of spoken audio into written text.
Voice Cloning
Voice cloning is the synthesis of a target speaker's voice from a short reference audio sample, allowing a TTS system to produce new speech in that speaker's...
Speaker Diarization
Speaker diarization is the "who spoke when" task: segmenting a multi-speaker audio recording by speaker identity and attaching speaker labels to each transcript segment.
Prosody
Prosody is the rhythm, stress, intonation, and pacing of speech — the suprasegmental layer above individual phonemes that carries emotion, emphasis, question vs.
Realtime Voice API
A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline.