Tokenizer: Get a better understanding of what counts as a token (Grok)

in #blog10 hours ago

In the context of Large Language Models (LLMs), a token is a unit of text that the model processes. Tokens can represent whole words, subwords, characters, or even punctuation, depending on the tokenisation method used.

How Tokens Work in LLMs

  • Tokenisation: Before processing text, LLMs convert input text into tokens using a tokeniser. This step breaks text into manageable pieces based on predefined rules.
  • Vocabulary: LLMs have a fixed vocabulary of tokens they can understand. If a word isn't in the vocabulary, it may be split into multiple subword tokens.
  • Processing: Each token is assigned a numerical representation (embedding), which the model processes to generate output.

Examples of Tokenisation

Word-based: "Hello world!" → ["Hello", "world", "!"]

Subword-based (Byte-Pair Encoding, BPE, used in GPT models):

  • "unhappiness" → ["un", "happiness"]
  • "running" → ["run", "ning"]

Character-based (used in some models):

  • "Hello" → ["H", "e", "l", "l", "o"]

Why Tokens Matter

  • Cost: LLMs charge based on token usage (e.g., OpenAI models like GPT-4 have pricing based on tokens).
  • Context Length: Models have a maximum number of tokens they can process in a single request (e.g., GPT-4 Turbo has a 128K token limit).
  • Processing Speed: More tokens mean longer processing times and higher computational costs.

Grok Tokenizer

You can click the Tokenizer in Grok
image.png

To have a rough idea of the tokens - IMHO, they are similar to words!
image.png

Steem to the Moon🚀!

Coin Marketplace

STEEM 0.21
TRX 0.25
JST 0.037
BTC 104320.01
ETH 3231.29
SBD 4.26