Tokenizer
A tokenizer is the component that converts raw text into a sequence of tokens (numerical IDs) that an AI model can process, and converts model output tokens back into readable text. Different models use different tokenization schemes — for example, Byte Pair Encoding (BPE) or SentencePiece — which affects how text is split, how many tokens a given text consumes, and how the model handles different languages.
Example
Using OpenAI's tiktoken tokenizer, the word "embeddings" becomes two tokens: ["embed", "dings"]. The phrase "café" might become ["caf", "é"]. Japanese text typically requires more tokens per character than English, which is why a 4,000-token limit holds fewer Japanese words.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts