What is prompt engineering?

Prompt engineering is the practice of designing and refining inputs (prompts) to AI language models to get accurate, relevant, and useful outputs. It involves techniques like providing context, specifying format, using examples, and structuring instructions clearly.

What is the difference between a prompt and a query?

A query is a simple question or search term, while a prompt is a structured instruction that provides context, constraints, role assignments, and output format specifications to guide an AI model toward a specific, high-quality response.

What does "zero-shot" vs "few-shot" prompting mean?

Zero-shot prompting asks the AI to perform a task without any examples. Few-shot prompting includes one or more examples in the prompt to show the AI the expected format and quality of the response. Few-shot generally produces more consistent results.

What is chain of thought prompting?

Chain of thought (CoT) prompting is a technique where you instruct the AI to reason through a problem step-by-step before giving a final answer. This improves accuracy on complex tasks like math, logic, and multi-step analysis.

What are system prompts?

System prompts are instructions that set the AI's behavior, role, and constraints for an entire conversation. They define how the AI should act (e.g., "You are a professional copywriter") and what rules it should follow throughout the interaction.

Prompt Engineering Glossary

Essential terms and concepts every prompt engineer should know. Browse 175 key definitions with examples and practical tips.

Chain of Thought Prompting: Chain of thought prompting is a technique that encourages an AI model to break down complex reasoning into sequential, intermediate steps before arriving at a final answer.
Context Window: A context window is the maximum amount of text (measured in tokens) that an AI model can process in a single interaction, including both the input prompt and the generated output.
Few-Shot Prompting: Few-shot prompting is a technique where you provide the AI model with a small number of examples (typically 2-5) within the prompt to demonstrate the desired format, style, or reasoning pattern.
Fine-Tuning: Fine-tuning is the process of further training a pre-trained AI model on a specific dataset to specialize its behavior for particular tasks or domains.
Grounding: Grounding is the practice of anchoring AI responses to specific, verifiable sources of information such as documents, databases, or real-time data.
Hallucination: A hallucination occurs when an AI model generates information that sounds plausible but is factually incorrect, fabricated, or unsupported by its training data.
In-Context Learning: In-context learning is the ability of a large language model to learn and adapt its behavior based on examples or instructions provided directly within the prompt, without any changes to the model's underlying weights.
Instruction Tuning: Instruction tuning is a training technique where a pre-trained language model is further trained on a curated dataset of instruction-response pairs to improve its ability to follow natural language instructions.
Large Language Model (LLM): A large language model (LLM) is an AI system trained on massive amounts of text data that can understand, generate, and reason about natural language.
Multi-Modal AI: Multi-modal AI refers to artificial intelligence systems that can process and generate content across multiple types of data — such as text, images, audio, and video — within a single model.
Negative Prompting: Negative prompting is a technique where you explicitly tell the AI model what to avoid, exclude, or not do in its response.
Persona Prompting: Persona prompting is a technique where you ask the AI to adopt a specific identity, personality, or character to shape the tone, vocabulary, and perspective of its responses.
Prompt Chaining: Prompt chaining is a strategy where you break a complex task into a sequence of simpler prompts, feeding the output of one step as input to the next.
Prompt Engineering: Prompt engineering is the practice of designing, refining, and optimizing the text inputs (prompts) given to AI models to elicit the most useful, accurate, and relevant outputs.
Prompt Injection: Prompt injection is a security vulnerability where a malicious user crafts input that overrides or manipulates the AI model's original instructions, causing it to ignore its guidelines or perform unintended actions.
Prompt Template: A prompt template is a reusable, pre-structured prompt with placeholder variables that can be filled in with specific details for each use.
Retrieval-Augmented Generation (RAG): Retrieval-augmented generation (RAG) is an architecture that enhances AI model responses by first retrieving relevant information from an external knowledge base and then including that information in the prompt for the model to reference.
Role Prompting: Role prompting is a technique where you assign the AI model a specific professional role or area of expertise to shape the depth, vocabulary, and perspective of its responses.
Self-Consistency: Self-consistency is a prompting strategy where you generate multiple responses to the same question using chain-of-thought reasoning, then select the most common answer among them.
System Prompt: A system prompt is a special set of instructions provided to an AI model before the user's message that defines the model's behavior, personality, constraints, and response format for the entire conversation.
Temperature: Temperature is a parameter that controls the randomness and creativity of an AI model's output.
Token: A token is the basic unit of text that AI models use to process and generate language.
Top-P (Nucleus Sampling): Top-P, also known as nucleus sampling, is a parameter that controls which tokens the model considers when generating each word.
Tree of Thought Prompting: Tree of thought prompting is an advanced reasoning technique where the AI model explores multiple branching solution paths simultaneously, evaluates each branch, and backtracks from dead ends before selecting the best path to the answer.
Zero-Shot Prompting: Zero-shot prompting is the simplest prompting approach where you give the AI model a task instruction without providing any examples.
Agentic AI: Agentic AI refers to AI systems that can autonomously plan, execute, and iterate on multi-step tasks with minimal human intervention.
Tool Use (Function Calling): Tool use, also called function calling, is the ability of an AI model to invoke external tools, APIs, or functions during a conversation to perform actions beyond text generation.
Model Context Protocol (MCP): The Model Context Protocol (MCP) is an open standard developed by Anthropic that provides a universal way to connect AI models to external data sources, tools, and services.
Reasoning Model: A reasoning model is an AI system specifically trained to perform extended, step-by-step thinking before producing a final answer.
AI Guardrails: AI guardrails are safety mechanisms, rules, and constraints built into AI systems to prevent harmful, biased, or undesired outputs.
Structured Output: Structured output refers to AI model responses that follow a specific, machine-readable format such as JSON, XML, CSV, or a defined schema.
Context Caching: Context caching is an optimization technique where AI providers store and reuse previously processed prompt prefixes across multiple API calls.
Embedding: An embedding is a numerical vector representation of text that captures its semantic meaning in a high-dimensional space.
Vector Database: A vector database is a specialized database designed to store, index, and efficiently query high-dimensional embedding vectors.
Prompt Optimization: Prompt optimization is the systematic process of iteratively refining prompts to improve the quality, accuracy, and consistency of AI model outputs.
AI Alignment: AI alignment is the field of research and practice focused on ensuring that AI systems behave in accordance with human values, intentions, and goals.
Knowledge Cutoff: A knowledge cutoff is the date beyond which an AI model has no training data, meaning it cannot answer questions about events, discoveries, or changes that occurred after that point.
Inference: Inference is the process of using a trained AI model to generate predictions or outputs from new inputs.
Beam Search: Beam search is a decoding strategy that explores multiple candidate output sequences simultaneously during text generation, keeping the top-k most probable sequences (the "beam width") at each step.
Tokenizer: A tokenizer is the component that converts raw text into a sequence of tokens (numerical IDs) that an AI model can process, and converts model output tokens back into readable text.
Attention Mechanism: An attention mechanism is a neural network component that allows a model to dynamically weigh the importance of different parts of the input when generating each part of the output.
Transformer: A transformer is the neural network architecture that powers virtually all modern large language models, including GPT, Claude, Gemini, and LLaMA.
Prompt Caching: Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests.
Constitutional AI: Constitutional AI (CAI) is a training methodology developed by Anthropic where an AI model is guided by a set of written principles (a "constitution") to self-critique and revise its own outputs during training.
Reinforcement Learning from Human Feedback (RLHF): Reinforcement learning from human feedback (RLHF) is a training method where human evaluators rank or score multiple AI outputs, and those preferences are used to train a reward model that further fine-tunes the language model.
Chain of Verification: Chain of verification (CoVe) is a prompting technique where the AI model first generates an initial response, then creates specific verification questions about its own claims, answers those questions independently, and finally revises the original response based on the verification results.
Meta-Prompting: Meta-prompting is the practice of using an AI model to generate, refine, or optimize prompts for other AI tasks.
AI Agent: An AI agent is a software system that uses a large language model as its reasoning core to autonomously plan, execute, and adapt multi-step workflows using external tools and data sources.
Few-Shot Chain of Thought: Few-shot chain of thought is a prompting technique that combines few-shot examples with explicit step-by-step reasoning demonstrations.
Prompt Leaking: Prompt leaking is an attack technique where a user crafts inputs designed to trick an AI model into revealing its hidden system prompt or confidential instructions.
AI Hallucination Detection: AI hallucination detection encompasses the methods, tools, and techniques used to identify when an AI model generates false, fabricated, or unsupported information.
Model Distillation: Model distillation is a technique for creating a smaller, more efficient "student" model that approximates the behavior of a larger "teacher" model.
Synthetic Data: Synthetic data is artificially generated data created by AI models or algorithmic processes rather than collected from real-world events.
Data Poisoning: Data poisoning is an adversarial attack that corrupts an AI model's training data to manipulate its behavior in targeted ways.
Jailbreaking: Jailbreaking refers to techniques used to bypass an AI model's built-in safety restrictions, content policies, and behavioral guidelines to produce outputs the model was trained to refuse.
Semantic Search: Semantic search is an information retrieval approach that finds results based on the meaning of a query rather than exact keyword matches.
Prompt Versioning: Prompt versioning is the practice of tracking changes to prompts over time using version control principles — assigning version identifiers, recording modifications, and maintaining a history of prompt iterations.
Output Parsing: Output parsing is the process of extracting structured, machine-readable data from an AI model's free-form text responses.
Latent Space: Latent space is the high-dimensional internal representation space where AI models encode the meaning, relationships, and features of input data as numerical vectors.
Zero-Shot Chain of Thought: Zero-shot chain of thought is a prompting technique where you append a simple phrase like "Let's think step by step" to a question without providing any reasoning examples.
Prompt Compression: Prompt compression encompasses techniques for reducing the length of a prompt while preserving its essential meaning and effectiveness.
AI Safety: AI safety is the interdisciplinary field focused on ensuring that AI systems behave as intended, remain under human control, and do not cause unintended harm.
Red Teaming: Red teaming in AI is the practice of systematically probing an AI system for vulnerabilities, failure modes, and harmful behaviors through adversarial testing.
Benchmark: A benchmark in AI is a standardized test suite with predefined tasks, datasets, and evaluation metrics used to measure and compare model performance.
Perplexity: Perplexity is a standard metric for evaluating how well a language model predicts a sequence of text.
Logits: Logits are the raw, unnormalized numerical scores that a language model assigns to each token in its vocabulary as the potential next token.
Sampling: Sampling is the process of selecting the next token from the probability distribution a language model produces at each generation step.
Stop Sequence: A stop sequence is a predefined token, string, or pattern that signals the AI model to immediately stop generating text when encountered in the output.
JSON Mode: JSON mode is a model configuration setting that constrains the AI's output to be valid, parseable JSON.
Vision-Language Model (VLM): A vision-language model (VLM) is an AI system that can process, understand, and reason about both visual inputs (images, screenshots, diagrams) and text simultaneously within a single model architecture.
Function Calling: Function calling is an AI model capability where the model analyzes a user's prompt and generates structured JSON specifying which external function to invoke and what arguments to pass.
Test-Time Compute: Test-time compute is the practice of allocating additional computational resources during inference — when the model generates a response — rather than during training.
AI Overview: An AI Overview is an AI-generated summary box that appears at the top of Google search results, synthesizing information from multiple web sources to answer a user's query directly.
Generative Engine Optimization (GEO): Generative engine optimization (GEO) is the practice of structuring and enhancing content so that AI-powered platforms — like ChatGPT, Perplexity, and Google AI Overviews — cite, reference, or recommend it when generating responses.
Answer Engine Optimization (AEO): Answer engine optimization (AEO) is a content strategy focused on structuring web content to appear as direct answers in featured snippets, People Also Ask boxes, voice search results, and AI-generated summaries.
Thinking Model: A thinking model is an AI system that uses extended inference-time computation to reason through problems before producing a final answer.
Prompt Routing: Prompt routing is the practice of automatically directing each user prompt to the most suitable AI model based on task type, complexity, and cost constraints.
Multimodal Prompting: Multimodal prompting is the practice of combining multiple input types — such as text, images, audio, or video — within a single prompt to give an AI model richer context for its response.
Prompt Tuning: Prompt tuning is a parameter-efficient technique that adapts a large language model to specific tasks by training small learnable vectors called "soft prompts" that are prepended to the input.
Instruction Following: Instruction following is an AI model's ability to accurately understand and execute explicit directions given in a prompt — including format requirements, length constraints, tone specifications, and multi-step procedures.
Code Interpreter: A code interpreter is an AI capability that allows a model to write and execute code — typically Python — in a sandboxed environment to solve analytical, mathematical, or data processing tasks.
Deep Research: Deep research is an AI capability where the model autonomously conducts multi-step web research to produce comprehensive, sourced reports on complex topics.
Mixture of Experts (MoE): Mixture of experts (MoE) is a neural network architecture that divides a model into many specialized sub-networks called "experts" and uses a routing mechanism to activate only a small subset of them for each input.
Knowledge Graph: A knowledge graph is a structured database that represents real-world entities (people, places, concepts) and the relationships between them as an interconnected network of nodes and edges.
Quantization: Quantization is a technique that reduces an AI model's numerical precision — for example, converting 16-bit floating-point weights to 4-bit integers — to shrink the model's memory footprint and speed up inference.
LoRA (Low-Rank Adaptation): LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adapts a pre-trained AI model to new tasks by injecting small trainable matrices into the model's layers while keeping the original weights frozen.
Prompt Ensembling: Prompt ensembling is a technique that runs multiple variations of a prompt for the same task and combines their outputs to produce a more accurate and robust final result.
Self-Reflection: Self-reflection is a prompting technique where an AI model evaluates, critiques, and improves its own output in one or more follow-up steps.
Direct Preference Optimization (DPO): Direct preference optimization (DPO) is a training technique that aligns AI models with human preferences by learning directly from pairs of preferred and rejected outputs — without needing a separate reward model.
KV-Cache: A KV-cache (key-value cache) stores the computed attention key and value matrices from previously processed tokens so the model does not need to recalculate them when generating each new token.
AI Watermarking: AI watermarking is the practice of embedding hidden, machine-detectable patterns into AI-generated content — text, images, audio, or video — so that the content can later be identified as AI-produced.
Prompt Injection Defense: Prompt injection defense refers to the techniques and strategies used to protect AI systems from prompt injection attacks, where malicious inputs attempt to override the model's original instructions.
Context Stuffing: Context stuffing is the technique of loading relevant information — documents, data, or examples — directly into an AI model's prompt to give it the knowledge needed to answer accurately.
Model Collapse: Model collapse is a phenomenon where AI models progressively degrade when trained on data generated by other AI models rather than human-created content.
Autonomous Agent: An autonomous agent is an AI system that can independently plan, decide, and execute multi-step tasks to achieve a goal with minimal human oversight.
Benchmark Contamination: Benchmark contamination occurs when an AI model's training data accidentally or deliberately includes questions and answers from the benchmark tests used to evaluate it.
Emergent Behavior: Emergent behavior in AI refers to capabilities that appear unexpectedly in large language models as they scale up in size, without being explicitly programmed or trained for those tasks.
Catastrophic Forgetting: Catastrophic forgetting is a phenomenon where a neural network rapidly loses previously learned knowledge when it is trained on new data or tasks.
Few-Shot Learning: Few-shot learning is a machine learning approach where a model learns to perform a new task from only a handful of training examples — sometimes as few as one to five.
Transfer Learning: Transfer learning is a machine learning technique where a model trained on one task or dataset is reused as the starting point for a different but related task.
Semantic Similarity: Semantic similarity is a measure of how close two pieces of text are in meaning, regardless of whether they share the same words.
Grok: Grok is the family of conversational AI models built by xAI, distinguished from other major assistants by its real-time access to posts on X (formerly Twitter) and a less filtered response style.
xAI: xAI is the artificial intelligence research company founded by Elon Musk in 2023.
Context Engineering: Context engineering is the discipline of deliberately assembling everything an AI model sees at inference time — system prompt, retrieved documents, conversation memory, tool outputs, few-shot examples, and formatting scaffolding — so the model has exactly the information it needs to produce a high-quality response.
ReAct Prompting: ReAct prompting is a technique that interleaves Reasoning and Acting: the model writes a short reasoning trace about what to do next, takes an action (typically a tool call such as a search or calculation), observes the result, and then reasons again before the next step.
SurePrompts Quality Rubric: The SurePrompts Quality Rubric is a 7-dimension scoring framework for evaluating prompt quality: role clarity, context sufficiency, instruction specificity, format structure, example quality, constraint tightness, and output validation.
RCAF Prompt Structure: RCAF is a 4-part prompt skeleton — Role, Context, Action, Format — for drafting maintainable AI prompts.
Context Engineering Maturity Model: The Context Engineering Maturity Model is a 5-level framework for describing how sophisticated a team's context assembly practice is.
Agentic Prompt Stack: The Agentic Prompt Stack is a 6-layer model for designing prompts that run AI agents: Goals, Tool permissions, Planning scaffold, Memory access, Output validation, and Error recovery.
Extended Thinking: Extended thinking is a Claude feature that lets the model allocate additional reasoning tokens before producing its final answer, with a user-controllable thinking budget set per request.
Computer Use: Computer use is an Anthropic capability in which Claude controls a virtual computer via screenshots and keyboard/mouse actions.
Voice Prompting: Voice prompting is the practice of writing prompts for realtime voice and audio AI interfaces — speech-to-speech systems, voice agents, and realtime APIs — where the output will be spoken aloud rather than read.
Semantic Caching: Semantic caching is a pattern for caching LLM responses keyed by meaning similarity rather than exact prompt match.
LLM-as-Judge: LLM-as-judge is an evaluation pattern in which an LLM scores the outputs of another model against a rubric.
Plan-and-Execute Prompting: Plan-and-execute prompting is a two-phase agent pattern.
Self-Refine Prompting: Self-refine prompting is an iterative pattern in which the model generates an output, critiques its own output against specified criteria, then produces a revised version.
Reflexion Prompting: Reflexion is an agent prompting pattern in which, after a failed attempt, the agent generates a short verbal reflection on what went wrong and uses that reflection as additional context for its next attempt.
Skeleton of Thought: Skeleton of thought is a reasoning pattern in which the model first produces a compact skeleton of the answer — a list of points or an outline — and then expands each skeleton point, often as independent sub-prompts running in parallel.
Tool Choice: Tool choice is an API parameter on modern tool-calling models that controls whether and how the model selects a tool.
Chain of Density: Chain of density is a summarization technique in which the model iteratively rewrites a summary, each pass adding more salient entities while keeping total length constant.
Program of Thoughts: Program of thoughts is a reasoning technique in which the model generates code — typically Python — to solve a numerical or logical problem, then executes the code to obtain the answer.
Self-Ask Prompting: Self-ask prompting is a reasoning pattern in which the model explicitly asks itself follow-up questions before answering a composite question.
ReWOO (Reasoning WithOut Observation): ReWOO is an agent architecture that separates planning from execution.
RAFT (Retrieval-Augmented Fine-Tuning): RAFT is a training technique that combines retrieval-augmented generation with fine-tuning.
Eval Harness: An eval harness is infrastructure that runs a prompt or model against a fixed test set and computes aggregate scores per metric.
Golden Set: A golden set is a curated collection of input-output pairs that represent the correct behavior for a given task.
Indirect Prompt Injection: Indirect prompt injection is a security vulnerability in which malicious instructions are embedded in content the model retrieves — a web page, email, PDF, or database row — rather than typed by the end user.
DSPy: DSPy is a programming framework, originally from Stanford, that treats prompts as functions with typed signatures rather than strings.
Prompt Observability: Prompt observability is the operational practice of logging, tracing, and monitoring prompt inputs, outputs, and model behavior in production.
Least-to-Most Prompting: Least-to-most prompting is a reasoning pattern in which the model first decomposes a complex problem into an ordered sequence of easier sub-problems, then solves each sub-problem in turn, feeding earlier answers into later ones.
Auto-CoT (Automatic Chain of Thought): Auto-CoT is a method for generating chain-of-thought demonstrations automatically rather than hand-writing them.
Step-Back Prompting: Step-back prompting is a technique in which the model first generates a higher-level abstraction, principle, or generalization — a "step back" from the specific question — before answering.
Active Prompting: Active prompting is an adaptive approach to few-shot example selection that borrows from active learning.
Chain of Code: Chain of Code is a hybrid reasoning pattern in which the model produces a trace that interleaves executable code with natural-language "pseudocode" comments.
Self-Debug Prompting: Self-debug prompting is a pattern in which the model generates code, an interpreter executes it, and the model receives the execution result — error messages, failed test output, or unexpected values — as additional context for a revised attempt.
Model Cascade: A model cascade is a routing pattern in which each request is first attempted by a cheaper, smaller model and only escalated to a stronger, more expensive model when the small model's confidence is low or its output fails a validation check.
Prefix-Tuning: Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen.
RAGAS: RAGAS is an open-source evaluation framework for retrieval-augmented generation systems.
Structured Decoding: Structured decoding is an inference-time technique that constrains the model's output to conform to a grammar, regular expression, or JSON schema by masking invalid tokens at each generation step.
Chunking: Chunking is the process of splitting source documents into smaller pieces before they are embedded and indexed for retrieval.
Reranking: Reranking is a secondary scoring pass over an initial set of retrieval candidates to improve their ordering before they are handed to the generator.
HyDE (Hypothetical Document Embeddings): HyDE is a retrieval technique in which the language model first generates a hypothetical answer to the user's query, and then that hypothetical answer — not the original query — is embedded and used to retrieve real documents by vector similarity.
Embedding Model: An embedding model is a machine-learning model that maps text (or images, audio, code) to a fixed-dimensional vector such that semantically similar inputs land near each other in vector space.
Hybrid Search: Hybrid search is a retrieval technique that combines keyword-based search — typically BM25 over an inverted index — with vector-based semantic search, and fuses the two rankings into a single result list.
Query Rewriting: Query rewriting is a retrieval preprocessing step that transforms the user's question before it is sent to the retriever.
RLAIF (Reinforcement Learning from AI Feedback): RLAIF is a training technique that uses AI-generated preferences — typically from a strong LLM acting as a judge — to guide reinforcement-learning fine-tuning, in place of the human labelers used in RLHF.
Mixture of Prompts: Mixture of prompts is an ensembling pattern where the same input is run through several different prompts and the resulting outputs are combined — by majority vote, averaging, or a meta-model that reads all of them.
Self-Critique Prompting: Self-critique prompting is a pattern where the model is asked to evaluate its own output against specific criteria, surface weaknesses, and suggest improvements — but deliver the critique as an output, not a rewrite.
GraphRAG: GraphRAG is a retrieval-augmented-generation variant that builds a knowledge graph from the source corpus — extracting entities, relationships, and community clusters — and uses the graph structure as retrieval context alongside or in place of raw document chunks.
Agentic RAG: Agentic RAG is a pattern where retrieval is treated as a tool call inside an agent loop rather than as a fixed first step in a linear pipeline.
Corrective RAG (CRAG): Corrective RAG is a 2024 retrieval pattern that adds a relevance-grading step between retrieval and generation: every retrieved document is scored by a lightweight evaluator for how well it answers the query, and the pipeline branches on the aggregate confidence.
Self-RAG: Self-RAG is a pattern in which the language model emits special reflection tokens that control its own retrieval and generation decisions.
Contextual Compression: Contextual compression is a preprocessing step that sits between retrieval and generation in a RAG pipeline.
Parent-Document Retrieval: Parent-document retrieval is a chunking-and-retrieval pattern that separates the unit used for matching from the unit used for generation.
Semantic Router: A semantic router is an embedding-based routing layer that classifies an incoming query to one of several downstream prompts, agents, tools, or models by computing similarity between the query embedding and a set of labeled reference utterances.
Multimodal RAG: Multimodal RAG is a retrieval-augmented-generation variant in which the indexed corpus and the retrieval step span multiple modalities — text, images, tables, figures, audio, or video — not just plain text.
Long-Term Memory (Agent Memory): Long-term memory is a persistent store that gives an agent access to information across sessions — user preferences, prior decisions, past tool results worth remembering, or accumulated background about a project.
Context Rot: Context rot is the degradation of model performance as a context window fills up with more content.
Document AI (Layout-Aware Parsing): Document AI refers to techniques and services for extracting structured content from complex documents — layout, reading order, tables, figures, forms, handwriting — before the output is fed to an LLM or embedding model.
BM25: BM25 is the dominant sparse-retrieval algorithm and the default scoring function in Elasticsearch, Lucene, OpenSearch, and most Postgres full-text setups.
Reciprocal Rank Fusion (RRF): Reciprocal Rank Fusion is a technique for merging several ranked result lists — produced by different retrievers over the same corpus — into a single unified ranking.
Contextual Retrieval: Contextual Retrieval is a technique introduced by Anthropic in 2024 that prepends a short chunk-specific context summary to each chunk before it is embedded and indexed for BM25.
Cross-Encoder: A cross-encoder is a transformer architecture that takes a query and a candidate document as a single joint input — typically concatenated with a separator token — and outputs one scalar relevance score.
Bi-Encoder: A bi-encoder is a dual-tower transformer architecture in which the query and the document are encoded independently by the same (or twin) encoder into separate fixed-size vectors, and relevance is computed as cosine or dot-product similarity between those vectors.
ColBERT (Late Interaction Retrieval): ColBERT is a retrieval architecture that sits between bi-encoders and cross-encoders.
Many-Shot Jailbreaking: Many-shot jailbreaking is a long-context attack pattern identified by Anthropic researchers in 2024.
Multi-Query Retrieval: Multi-query retrieval is a RAG pattern that hedges against the single-query-phrasing failure mode of standard retrieval.
Needle in a Haystack: Needle in a haystack is a long-context evaluation pattern that measures whether a model can retrieve a specific fact (the needle) planted at an arbitrary position inside a long irrelevant passage (the haystack).
Lost in the Middle: Lost in the middle is the finding from Liu et al.
Text to Speech (TTS): Text to speech, or TTS, is the synthesis of spoken audio from written text.
Speech to Text (STT): Speech to text, or STT — also called automatic speech recognition (ASR) — is the transcription of spoken audio into written text.
Voice Cloning: Voice cloning is the synthesis of a target speaker's voice from a short reference audio sample, allowing a TTS system to produce new speech in that speaker's timbre, accent, and (to a lesser extent) speaking style.
Speaker Diarization: Speaker diarization is the "who spoke when" task: segmenting a multi-speaker audio recording by speaker identity and attaching speaker labels to each transcript segment.
Prosody: Prosody is the rhythm, stress, intonation, and pacing of speech — the suprasegmental layer above individual phonemes that carries emotion, emphasis, question vs.
Realtime Voice API: A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline.