Tokenizer: Get a better understanding of what counts as a token (Grok)

justyy (83)in #blog • 3 months ago

In the context of Large Language Models (LLMs), a token is a unit of text that the model processes. Tokens can represent whole words, subwords, characters, or even punctuation, depending on the tokenisation method used.

How Tokens Work in LLMs

Tokenisation: Before processing text, LLMs convert input text into tokens using a tokeniser. This step breaks text into manageable pieces based on predefined rules.
Vocabulary: LLMs have a fixed vocabulary of tokens they can understand. If a word isn't in the vocabulary, it may be split into multiple subword tokens.
Processing: Each token is assigned a numerical representation (embedding), which the model processes to generate output.

Examples of Tokenisation

Word-based: "Hello world!" → ["Hello", "world", "!"]

Subword-based (Byte-Pair Encoding, BPE, used in GPT models):

"unhappiness" → ["un", "happiness"]
"running" → ["run", "ning"]

Character-based (used in some models):

"Hello" → ["H", "e", "l", "l", "o"]

Why Tokens Matter

Cost: LLMs charge based on token usage (e.g., OpenAI models like GPT-4 have pricing based on tokens).
Context Length: Models have a maximum number of tokens they can process in a single request (e.g., GPT-4 Turbo has a 128K token limit).
Processing Speed: More tokens mean longer processing times and higher computational costs.