Skip to main content

HyDE (Hypothetical Document Embeddings)

HyDE is a retrieval technique in which the language model first generates a hypothetical answer to the user's query, and then that hypothetical answer — not the original query — is embedded and used to retrieve real documents by vector similarity.

The idea, introduced by Gao et al. in 2022, is that in embedding space a plausible-looking answer is often closer to the real supporting documents than a short, under-specified question. HyDE helps most when queries are terse and the documents that should match are long-form prose, where the vocabulary and sentence shape of the query and the document differ substantially. It is a prompt-time trick rather than a training change, so it composes cleanly with other retrieval upgrades.

Example

A research assistant is asked "effects of vitamin D on muscle recovery?" — a short noun phrase. Directly embedding the query returns mixed results. With HyDE, the model first drafts a paragraph-length hypothetical answer about vitamin D, muscle-protein synthesis, and recovery timelines; that paragraph is then embedded. Vector search against the hypothetical retrieves seven more on-topic studies in the top ten than the direct-query baseline did.

Frequently asked questions

What is HyDE (Hypothetical Document Embeddings)?

HyDE is a retrieval technique in which the language model first generates a hypothetical answer to the user's query, and then that hypothetical answer — not the original query — is embedded and used to retrieve real documents by vector similarity.

How does HyDE (Hypothetical Document Embeddings) work?

The idea, introduced by Gao et al. in 2022, is that in embedding space a plausible-looking answer is often closer to the real supporting documents than a short, under-specified question.

Can you give an example of HyDE (Hypothetical Document Embeddings)?

A research assistant is asked "effects of vitamin D on muscle recovery?" — a short noun phrase. Directly embedding the query returns mixed results. With HyDE, the model first drafts a paragraph-length hypothetical answer about vitamin D, muscle-protein synthesis, and recovery timelines; that paragraph is then embedded. Vector search against the hypothetical retrieves seven more on-topic studies in the top ten than the direct-query baseline did.