Skip to main content

Tokenizer

A tokenizer is the component that converts raw text into a sequence of tokens (numerical IDs) that an AI model can process, and converts model output tokens back into readable text. Different models use different tokenization schemes — for example, Byte Pair Encoding (BPE) or SentencePiece — which affects how text is split, how many tokens a given text consumes, and how the model handles different languages.

Example

Using OpenAI's tiktoken tokenizer, the word "embeddings" becomes two tokens: ["embed", "dings"]. The phrase "café" might become ["caf", "é"]. Japanese text typically requires more tokens per character than English, which is why a 4,000-token limit holds fewer Japanese words.

Frequently asked questions

What is Tokenizer?

A tokenizer is the component that converts raw text into a sequence of tokens (numerical IDs) that an AI model can process, and converts model output tokens back into readable text.

Can you give an example of Tokenizer?

Using OpenAI's tiktoken tokenizer, the word "embeddings" becomes two tokens: ["embed", "dings"]. The phrase "café" might become ["caf", "é"]. Japanese text typically requires more tokens per character than English, which is why a 4,000-token limit holds fewer Japanese words.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts