Understanding and Implementing Tokenizers: A comprehensive guide with code

Welcome, language enthusiasts and curious coders! 👋 Are you ready to embark on a fascinating journey into the world of Natural Language Processing (NLP)? Buckle up, because we’re about to dive deep into tokenization—a fundamental building block of NLP models!

This blog post was inspired by Andrej Karpathy’s excellent video “Let’s build the GPT Tokenizer”.which I highly recommend if you want a deeper, hands-on exploration of tokenization.

The Quest for Understanding: NLP’s First Frontier

Imagine trying to teach a computer to understand Shakespeare or to interpret the latest tweet from your favorite influencer. Sounds challenging, right? That’s exactly what NLP aims to do—and tokenization is our trusty sidekick on this quest. It’s the crucial first step that bridges the chaotic, nuanced world of human language and the structured, numerical format that machines understand.

In this blog, we’ll guide you through the diverse landscape of tokenization. From the simplest method of splitting text by whitespace to the more advanced techniques like subword tokenization used by models such as GPT and BERT, we’ll not only cover the theory but also provide practical code examples. Together, we’ll implement various tokenizers from scratch!

Your Tokenization Toolkit: A GitHub Repository of Code

This blog post comes with a full set of code implementations to help you follow along. I’ve prepared a GitHub repository with all the tokenizers we’ll discuss.

Repository Name: Tokenizer-Tutorial

URL: https://github.com/verdugo-danieML/Tokenizer-Tutorial

Here’s how to get started:

  1. Clone the repository to your local machine:
git clone https://github.com/verdugo-danieML/Tokenizer-Tutorial.git
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Navigate to the examples directory, where each tokenizer has its own python script.

You can find detailed usage instructions in the repository’s README.md.

Table of Contents

  1. Introduction to Tokenization
  2. Whitespace Tokenizer
  3. Regex Tokenizer
  4. Byte Pair Encoding (BPE) Tokenizer
  5. Custom Hugging Face Tokenizer
  6. Custom SentencePiece Tokenizer
  7. Conclusion

Let’s dive in and start our exploration of the fascinating world of tokenization!

Introduction: The Art and Science of Tokenization

Imagine you’re a linguist, tasked with deciphering an ancient script. You begin by breaking the continuous stream of symbols into meaningful units—words, phrases, or even individual characters. This is tokenization in a nutshell. In NLP, tokenization serves a similar role: it acts as the bridge between raw, unprocessed text and the structured data that machines can understand.

What is Tokenization? 😳

At its core, tokenization is the process of converting a sequence of characters into meaningful chunks, known as tokens. These tokens can be words, subwords, or even individual characters depending on the method chosen. Tokenization is the foundation upon which all higher-level NLP tasks are built, from sentiment analysis to machine translation.

A Brief History of Tokenization ☝️🤓

Tokenization isn’t a new concept. It dates back to the early days of computer science and information retrieval in the 1950s and 60s. As researchers began developing systems to process text using computers, they quickly realized the need to break text into smaller, manageable units for analysis.

  • ❗ Fun Fact #1: The term “token” in computer science was originally inspired by mechanical calculators, where physical tokens represented numbers or operations.

Early tokenization methods were rudimentary—often just splitting text by spaces or punctuation. However, as NLP techniques advanced, so did tokenization.

Why is Tokenization Important anyway? 😒

  1. Bridge between humans and machines: Tokenization converts human-readable text into a format that computers can process efficiently.

  2. Vocabulary management: It helps create a finite vocabulary from an infinite set of possible word combinations.

  3. Handling out-of-vocabulary (OOV) words: Advanced tokenization methods, such as subword tokenization, can handle words not seen during training.

  4. Language agnosticism: Certain tokenization methods work across multiple languages without needing custom rules.

  5. Improved model performance: Effective tokenization can drastically improve the performance of NLP models.

    • ❗ Fun Fact #2: The choice of tokenization method can have a huge impact on your model’s vocabulary size. For example, character-level tokenization of English text would yield a vocabulary of around 26 letters (plus punctuation), whereas word-level tokenization could result in a vocabulary of hundreds of thousands of unique words!

The Tokenization Spectrum

Tokenization techniques span a spectrum, balancing vocabulary size with the granularity of the tokens:

  1. Character-level tokenization: Splits text into individual characters. It results in a small vocabulary but sacrifices word-level meaning.
  2. Word-level tokenization: Splits text into words, preserving word meaning but potentially resulting in a large vocabulary that struggles with out-of-vocabulary words.
  3. Subword tokenization: Strikes a balance by breaking words into smaller, meaningful subunits. Methods like Byte Pair Encoding (BPE) and WordPiece fall into this category.
  • ❗ Fun Fact #3: The word “tokenization” would be split differently by these three methods:

  • Character-level: [’t', ‘o’, ‘k’, ‘e’, ‘n’, ‘i’, ‘z’, ‘a’, ’t', ‘i’, ‘o’, ‘n’]

  • Word-level: [‘tokenization’]

  • Subword-level (e.g., BPE): [‘token’, ‘ization’]

Tokenization Challenges

While tokenization seems straightforward, it presents some interesting challenges:

  1. Language specificity: What works for English may not work for Chinese, Arabic, or other languages that use different writing systems.
  2. Punctuation and special characters: Should “don’t” be one token or two?
  3. Contractions and compound words: Is “ice cream” one token or two?
  4. Preserving semantic meaning: Splitting “New York” into “New” and “York” could lead to losing the city’s meaning as a single entity.
  5. Handling rare or misspelled words: What do we do with a word that isn’t in the vocabulary or is misspelled?

The Impact of Tokenization on Modern NLP

Tokenization plays a critical role in the performance of modern language models like GPT-3 and BERT. These models rely heavily on subword tokenization strategies to efficiently represent vast amounts of text data while handling rare words and preventing vocabulary explosion.

  • ❗ Fun Fact #4: OpenAI’s GPT-3 uses a tokenizer with a vocabulary of 50,257 tokens, enabling it to represent a broad range of text in a compact, efficient format while keeping the model size manageable.

As we explore various tokenization methods in this blog post, keep in mind that each technique offers a different solution to the core challenges of tokenization. From the simplicity of splitting on whitespace to the sophistication of learned subword tokenization, each method has a place in the NLP toolkit.

By the end of this journey, you’ll have a comprehensive understanding of tokenization and the practical skills to implement and apply these methods in your own NLP projects. Let’s get started!

Whitespace Tokenizer: The Simplest Tokenization Method

The Whitespace Tokenizer is the most straightforward method for breaking down text. It simply splits the text based on spaces. While it’s easy to implement and often sufficient for quick prototyping, it has some limitations that make it less ideal for more complex NLP tasks.

How Whitespace Tokenization Works? 🤔

Imagine we have a simple sentence:

"The quick brown fox jumps over the lazy dog!"

Using whitespace tokenization, the sentence would be split into tokens at each space:

tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog!"]

However, notice how the punctuation ("!") remains attached to the last word (“dog!"). This can be problematic because punctuation often needs to be handled separately, especially in tasks like sentiment analysis, machine translation, or question answering. In this case, “dog!” is treated as a single token, potentially leading to suboptimal performance.

Code for Whitespace Tokenizer:

class WhitespaceTokenizer:
    def __init__(self):
        self.vocab = {}
        self.inverse_vocab = {}

    def fit(self, text: str):
        tokens = self.tokenize(text)
        self.vocab = {token: i for i, token in enumerate(set(tokens))}
        self.inverse_vocab = {i: token for token, i in self.vocab.items()}

    def tokenize(self, text: str) -> List[str]:
        return text.split()

    def encode(self, text: str) -> List[int]:
        tokens = self.tokenize(text)
        return [self.vocab.get(token, len(self.vocab)) for token in tokens]

    def decode(self, token_ids: List[int]) -> str:
        tokens = [self.inverse_vocab.get(id, '<unk>') for id in token_ids]
        return self.detokenize(tokens)

    def detokenize(self, tokens: List[str]) -> str:
        return " ".join(tokens)

Pros and Cons of Whitespace Tokenization:

Pros:

  • Simple to implement: Just split on spaces.
  • Fast: This method is computationally efficient, making it useful for quick prototyping or when working with well-structured text.
  • Language-agnostic: Works well for languages with space-separated words (like English, Spanish, etc.).

Cons:

  • Poor punctuation handling: Words like “hello!” will be tokenized as a single token, potentially causing issues in downstream NLP tasks where punctuation plays a role.
  • Larger vocabularies: Since punctuation remains attached to words, this can lead to larger and more complex vocabularies (e.g., “hello” and “hello!” are treated as separate tokens).
  • Not suitable for all languages: Languages like Chinese, Japanese, or Thai that don’t use spaces between words cannot be tokenized using this method.

Example:

text = "The quick brown fox jumps over the lazy dog!"
tokenizer = WhitespaceTokenizer()
tokenizer.fit(text)
tokens = tokenizer.tokenize(text)
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print("Original:", text)
print("Tokenized:", tokens)
print("Encoded:", encoded)
print("Decoded:", decoded)

Output:

Original: The quick brown fox jumps over the lazy dog!
Tokenized: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog!']
Encoded: [1, 7, 2, 5, 8, 4, 6, 0, 3]
Decoded: The quick brown fox jumps over the lazy dog!

As you can see, the tokenizer splits on spaces, but “dog!” remains a single token, including punctuation. This might not be a problem for very simple tasks, but for more nuanced NLP applications, separating punctuation can be essential.

When to Use Whitespace Tokenization:

  • Prototyping: If you’re working on a quick proof of concept or a task where punctuation isn’t critical, whitespace tokenization might be enough.
  • Well-Formatted Data: If your text is already clean and punctuation isn’t an issue (e.g., programming code), whitespace tokenization works well.
  • Educational Purposes: It’s a great starting point for understanding more complex tokenization methods.

However, for more complex tasks or languages, you’ll likely need a more advanced tokenizer that can handle punctuation properly.

Regex Tokenizer: Unleashing the Power of Pattern Matching 🕵️‍♂️🔍

While the Whitespace Tokenizer splits text only by spaces, the Regex Tokenizer is much more flexible. By using regular expressions, we can define custom patterns for tokenizing text, allowing us to handle punctuation, contractions, and more complex linguistic structures.

Why Use a Regex Tokenizer?

Regular expressions (regex) provide a powerful way to define specific patterns in text. With a Regex Tokenizer, you can go beyond simply splitting text at spaces and handle more sophisticated cases, such as:

  • Splitting text based on word boundaries, while also separating punctuation.
  • Handling contractions like “don’t” (splitting it into “don” and “‘t”).
  • Dealing with symbols, numbers, or special characters based on custom rules.

Imagine you have this sentence:

"Hello, world! It's 2024."

Using a well-crafted regex, we can split this into:

["Hello", ",", "world", "!", "It", "'s", "2024", "."]

Notice how punctuation is now separated into its own tokens, unlike with the Whitespace Tokenizer.

Code for Regex Tokenizer:

Here’s an implementation of a simple Regex Tokenizer:

import regex
from typing import List, Dict
from src.utils import save_vocab, load_vocab

class RegexTokenizer:
    PATTERNS: Dict[str, str] = {
        'basic': r'\b\w+\b|\S',
        'gpt2': r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        'gpt4': r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
        'improved': r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}+(?:[.,]\p{N}+)?| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
    }

    def __init__(self, pattern: str = 'basic'):
        if pattern in self.PATTERNS:
            self.pattern = regex.compile(self.PATTERNS[pattern])
        else:
            self.pattern = regex.compile(pattern)
        self.vocab = {}
        self.inverse_vocab = {}

    def fit(self, text: str):
        """Build vocabulary from the given text."""
        tokens = self.tokenize(text)
        # Create a set of unique tokens, preserving leading spaces
        unique_tokens = set(tokens)
        # Create vocabulary, assigning each unique token an index
        self.vocab = {token: i for i, token in enumerate(unique_tokens)}
        self.inverse_vocab = {i: token for token, i in self.vocab.items()}

    def tokenize(self, text: str) -> List[str]:
        """Tokenize the input text using the specified regex pattern."""
        return self.pattern.findall(text)

    # encode, decode, and detokenize methods are similar to WhitespaceTokenizer
   

This regex pattern, \b\w+\b|\S, splits words while also ensuring that punctuation and symbols are treated as separate tokens.

How It Works:

  • \b\w+\b: Matches any word (sequence of alphanumeric characters) based on word boundaries.
  • \S: Matches any non-whitespace character, which includes punctuation like commas, periods, and exclamation points.

Example of Regex Tokenizer in Action:

from src.regex_tokenizer import RegexTokenizer

tokenizer = RegexTokenizer(pattern='gpt2')
text = "Don't you love tokenizers? They're amazing!"

tokenizer.fit(text)
tokens = tokenizer.tokenize(text)
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print("Original:", text)
print("Tokenized:", tokens)
print("Encoded:", encoded)
print("Decoded:", decoded)

Output:

Original: Don't you love tokenizers? They're amazing!
Tokenized: ['Don', "'t", ' you', ' love', ' tokenizers', '?', ' They', "'re", ' amazing', '!']
Encoded: [9, 4, 5, 0, 8, 2, 3, 1, 7, 6]
Decoded: Don't you love tokenizers? They're amazing!

As you can see, the Regex Tokenizer handles punctuation correctly by separating it into individual tokens. It also splits contractions like “Don’t” and “They’re” into “Don” and “t”, and “They” and “re”, which is crucial for many NLP tasks, including machine translation and sentiment analysis.

Pros and Cons of Regex Tokenization:

Pros:

  • Flexible: You can define custom patterns to handle complex text structures, such as separating punctuation, numbers, or contractions.
  • Precise: Fine-tuned regex patterns can accurately capture specific linguistic features, like email addresses, phone numbers, or URLs.
  • Language-agnostic: Can be adapted to different languages with customized patterns.

Cons:

  • Complexity: Writing effective regular expressions can be tricky, especially for handling edge cases or non-standard text formats.
  • Performance: Processing very large datasets with complex regex patterns can be slower than simpler tokenization methods.

When to Use a Regex Tokenizer:

  • Punctuation Matters: If punctuation is critical to your task (such as in sentiment analysis, where punctuation can influence sentiment), regex tokenization is a good choice.
  • Handling Complex Text: Regex allows you to customize tokenization for specific linguistic features, like contractions, symbols, or special characters.
  • Processing Noisy Text: Regex is useful when working with text that includes numbers, dates, email addresses, or social media text, where standard tokenizers may struggle.

Note: Despite its flexibility, regex tokenization can be overkill for simpler tasks. For basic tokenization, you might not need such fine-grained control.

Example Use Case:

Consider the following scenario: you’re working on a task that involves processing product reviews. Customers might use abbreviations, contractions, and punctuation in various ways. A Whitespace Tokenizer may fail to split “great!” from “great”, but a Regex Tokenizer can separate these, making it much more effective for downstream sentiment analysis tasks.

For example:

"I love this product!!! It's amazing!"

A Regex Tokenizer can split this into:

['I', 'love', 'this', 'product', '!', '!', '!', 'It', "'s", 'amazing', '!']

With punctuation separated from words, sentiment analysis models can weigh the exclamations more heavily, which can improve the accuracy of detecting sentiment.

Byte Pair Encoding Tokenizer: a clever subword tokenization method. 🧠🔤

Now that we’ve covered basic tokenization methods like whitespace and regex, it’s time to dive deeper into one of the most powerful and widely used techniques: Byte Pair Encoding (BPE). BPE strikes a balance between character-level and word-level tokenization, making it ideal for handling rare words and out-of-vocabulary (OOV) tokens in modern NLP models.

How Does BPE Work? 🤔

BPE is a subword tokenization technique that starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of characters or subwords to form longer tokens. The goal is to achieve a compact vocabulary that captures both common words and parts of rare or unseen words, making it highly efficient for large text corpora.

Here’s the high-level process:

  1. Initial Vocabulary: Begin with a vocabulary of single characters.
  2. Identify Frequent Pairs: Find the most frequent adjacent character or subword pairs in the text.
  3. Merge Pairs: Replace these pairs with a new, merged token.
  4. Repeat: Continue the process until the desired vocabulary size is reached.

By merging subwords, BPE creates tokens that are large enough to capture common words, but small enough to handle rare or OOV words by breaking them into meaningful subword units.

This is why BPE is widely used in models like GPT, BERT, and other transformer-based models—it allows the model to represent rare words efficiently without blowing up the vocabulary size.

Code for BPE Tokenizer

Here’s an implementation of the BPE tokenizer:

from typing import List, Dict, Tuple
from collections import defaultdict
import re
from src.utils import read_corpus, save_vocab, load_vocab
import json

class BPETokenizer:
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {"<unk>": 0, "<s>": 1, "</s>": 2}
        self.inverse_vocab = {0: "<unk>", 1: "<s>", 2: "</s>"}
        self.merges = {}

    def train(self, corpus_dir: str):
        corpus = read_corpus(corpus_dir)
        word_freqs = defaultdict(int)
        for word in corpus.split():
            word = ' '.join(word) + ' </w>'
            word_freqs[word] += 1

        for word in word_freqs:
            for char in word.split():
                if char not in self.vocab:
                    self.vocab[char] = len(self.vocab)
                    self.inverse_vocab[self.vocab[char]] = char

        num_merges = self.vocab_size - len(self.vocab)
        for i in range(num_merges):
            pairs = self._get_stats(word_freqs)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            self._merge_vocab(best, word_freqs)
            if len(self.vocab) >= self.vocab_size:
                break

    def _get_stats(self, word_freqs: Dict[str, int]) -> Dict[Tuple[str, str], int]:
        pairs = defaultdict(int)
        for word, freq in word_freqs.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[symbols[i], symbols[i + 1]] += freq
        return pairs

    def _merge_vocab(self, pair: Tuple[str, str], word_freqs: Dict[str, int]):
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        self.merges[bigram] = replacement
        self.vocab[replacement] = len(self.vocab)
        self.inverse_vocab[self.vocab[replacement]] = replacement
        
        new_word_freqs = {}
        for word, freq in word_freqs.items():
            new_word = word.replace(bigram, replacement)
            new_word_freqs[new_word] = freq
        word_freqs.clear()
        word_freqs.update(new_word_freqs)

    def tokenize(self, text: str) -> List[str]:
        words = text.lower().split()
        tokens = []
        for word in words:
            word = ' '.join(word) + ' </w>'
            while True:
                subwords = word.split()
                if len(subwords) == 1:
                    break
                i = 0
                while i < len(subwords) - 1:
                    bigram = ' '.join(subwords[i:i+2])
                    if bigram in self.merges:
                        subwords[i] = self.merges[bigram]
                        del subwords[i+1]
                    else:
                        i += 1
                new_word = ' '.join(subwords)
                if new_word == word:
                    break
                word = new_word
            tokens.extend(word.split())
        return tokens

    # encode, decode methods follow...

Analyzing BPE Tokenization Output

Let’s walk through a practical example to illustrate how BPE tokenization works in action.

Input Text:

"The quick brown fox jumps over the lazy dog"

BPE Tokenization Process:

  1. Initial Vocabulary: At first, each character is treated as a token:

    ['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x']
    
  2. Merging Frequent Pairs: The most frequent pairs of characters are merged into subwords. For example, frequent combinations like “Th”, “qu”, “ick”, and “fox” may become new subword tokens.

  3. Final Tokens:

    ['the</w>', 'qui', 'ck', '</w>', 'bro', 'w', 'n</w>', 'f', 'o', 'x</w>', 'ju', 'm', 'ps</w>', 'ov', 'er</w>', 'the</w>', 'la', 'z', 'y</w>', 'dog', '</w>']
    

Example Output:

text = "The quick brown fox jumps over the lazy dog"
tokenizer = BPETokenizer(vocab_size=1000)
tokenizer.train(text)
tokens = tokenizer.tokenize(text)

print("Tokenized:", tokens)

Output:

Tokenized: ['the</w>', 'qui', 'ck', '</w>', 'bro', 'w', 'n</w>', 'f', 'o', 'x</w>', 'ju', 'm', 'ps</w>', 'ov', 'er</w>', 'the</w>', 'la', 'z', 'y</w>', 'dog', '</w>']

Discussion of Output:

In this example, you can see how BPE gradually merges frequent character pairs to form subwords like “qui”, “ck”, and “the”. The special token indicates the end of a word, allowing BPE to preserve word boundaries. This is particularly useful for handling rare or unseen words, as BPE can break them down into smaller, meaningful subword units.

For instance, if the word “lazydog” appeared later in the text but wasn’t part of the original training data, BPE could tokenize it as:

['la', 'z', 'y</w>', 'dog', '</w>']

This capability allows BPE to handle words it hasn’t seen before without resorting to an unknown token (), which is crucial for dealing with rare words in NLP tasks like machine translation and language modeling.

Why BPE is Important for Modern NLP Models

BPE is highly effective because it provides a flexible, scalable way to handle vast amounts of text data. It’s the method of choice in many state-of-the-art models like GPT-2, GPT-3, and BERT. By breaking down words into subwords, BPE helps these models:

  1. Handle Rare Words: Words that were not seen during training can still be broken into subwords, allowing the model to generalize better.
  2. Reduce Vocabulary Size: Instead of having a vocabulary with hundreds of thousands of unique words, BPE allows models to represent language with a much smaller, more efficient vocabulary of subword units.
  3. Improve Performance: The smaller, compressed vocabulary size improves both memory usage and model training time, while still preserving the ability to capture the meaning of words.

Pros and Cons of BPE:

Pros:

  • Efficient Vocabulary: BPE reduces the vocabulary size while still being able to represent rare or unseen words.
  • Generalization: By breaking words into subword units, BPE allows the model to generalize to new words that weren’t seen during training.
  • Widely Used: BPE is a standard in most large-scale transformer models like GPT and BERT, making it a reliable choice for NLP applications. Cons:
  • Training Overhead: BPE requires training on a corpus, which can be time-consuming.
  • Complex Tokenization: BPE can result in tokens that are not intuitive to read or understand (e.g., breaking words into subwords like “qu” and “ick”).

By understanding how BPE works and why it’s so widely used, you can better appreciate its role in modern NLP. Whether you’re dealing with rare words, out-of-vocabulary terms, or just aiming for a more efficient model, BPE is a powerful tool in your NLP toolkit.

Custom Hugging Face Tokenizer: A Glimpse into Modern NLP Libraries 🤗

As we move further into modern NLP, we encounter advanced tokenizers like the Hugging Face Tokenizer. Hugging Face’s tokenizers library provides a high-performance, flexible, and easy-to-use framework for handling tokenization in large language models like BERT, GPT-2, RoBERTa, and more.

What makes Hugging Face’s tokenizer unique is its ability to efficiently handle subword tokenization, vocabulary management, and encoding, while also being highly optimized for speed. By using Rust at its core, Hugging Face’s tokenizers library offers fast and memory-efficient tokenization compared to standard Python implementations. It’s designed for both training new tokenizers and using pre-trained ones, making it a go-to solution for modern NLP tasks.

Why Hugging Face Tokenizers?

While we’ve already explored methods like whitespace tokenization and BPE, Hugging Face tokenizers come with a range of pre-trained tokenizers that have been optimized for different models and use cases. Here are some reasons why Hugging Face tokenizers are so powerful:

  1. Subword Tokenization: Like BPE and WordPiece, Hugging Face tokenizers split words into smaller subword units. This makes them highly effective for handling rare words and unknown tokens.
  2. Fast and Memory Efficient: The tokenizers are implemented in Rust, allowing for faster execution and lower memory consumption.
  3. Pre-trained Tokenizers: Hugging Face provides pre-trained tokenizers for widely used models like BERT, GPT-2, RoBERTa, and more, saving time when you need a ready-to-use solution.
  4. Customizable: You can build and train your own tokenizers from scratch, making it adaptable for specific datasets or languages.

Let’s explore how to build a custom Hugging Face tokenizer using Byte-Level BPE, one of the popular subword tokenization methods.

Code for Custom Hugging Face Tokenizer

Here’s a simplified version of how you can create and train a custom Hugging Face tokenizer:

from tokenizers import Tokenizer, models, pre_tokenizers, processors, decoders, trainers
from src.utils import read_corpus

class CustomHFTokenizer:
    def __init__(self, vocab_size=25000):
        self.tokenizer = Tokenizer(models.BPE())
        self.tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
        self.trainer = trainers.BpeTrainer(
            vocab_size=vocab_size,
            special_tokens=["<|endoftext|>"],
            show_progress=True
        )
        self.tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
        self.tokenizer.decoder = decoders.ByteLevel()

    def train(self, corpus_dir):
        corpus = read_corpus(corpus_dir)
        self.tokenizer.train_from_iterator([corpus], trainer=self.trainer)

    def save(self, path):
        self.tokenizer.save(path)

    def load(self, path):
        self.tokenizer = Tokenizer.from_file(path)

    def encode(self, text):
        return self.tokenizer.encode(text).ids

    def decode(self, ids):
        return self.tokenizer.decode(ids)

    def tokenize(self, text):
        return self.tokenizer.encode(text).tokens

How It Works

  • models.BPE(): We use Byte-Level BPE as our tokenization strategy. This model handles byte-level input, meaning it can tokenize any input string, including special characters and punctuations.
  • Pre-tokenization: The ByteLevel pre-tokenizer is used to ensure that spaces are treated as separate tokens. This allows the tokenizer to maintain the exact structure of the original text.
  • Trainer: The BpeTrainer manages the training process. We can set parameters like vocabulary size and special tokens (e.g., <|endoftext|>).
  • Post-processing: After encoding, the ByteLevel processor ensures that offsets and boundaries of tokens are preserved.
  • Decoder: The ByteLevel decoder ensures that when we decode token IDs back into text, we get the original string back, including spaces and punctuation.

Example: Training and Tokenizing with Hugging Face Tokenizer

Let’s see how the Hugging Face tokenizer works in action.

from transformers import BertTokenizer, GPT2Tokenizer, RobertaTokenizer, PreTrainedTokenizerFast
from src.custom_hf_tokenizer import CustomHFTokenizer
from src.utils import read_corpus

def demonstrate_tokenizer(tokenizer, text):
    print(f"\nDemonstrating {tokenizer.__class__.__name__} functionality:")
    print("Input text:", text)

    tokens = tokenizer.tokenize(text)
    ids = tokenizer.encode(text)

    print("Tokenized:", tokens)
    print("Encoded:", ids)

    decoded = tokenizer.decode(ids)
    print("Decoded:", decoded)

def main():
    corpus_dir = "data/corpus/"

    # Example text
    example_text = "This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll)."

    # Custom HF Tokenizer
    custom_tokenizer = CustomHFTokenizer(vocab_size=25000)
    custom_tokenizer.train(corpus_dir)
    custom_tokenizer.save("custom_tokenizer.json")
    demonstrate_tokenizer(custom_tokenizer, example_text)

    # BERT Tokenizer
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    demonstrate_tokenizer(bert_tokenizer, example_text)

    # GPT-2 Tokenizer
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    demonstrate_tokenizer(gpt2_tokenizer, example_text)

    # RoBERTa Tokenizer
    roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    demonstrate_tokenizer(roberta_tokenizer, example_text)

if __name__ == "__main__":
    main()

Example Output:

Demonstrating CustomHFTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä cust', 'om', 'Ä token', 'iz', 'ers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punctuation', ',', 'Ä numbers', 'Ä like', 'Ä 3', 
'24', '6', ',', 'Ä and', 'Ä contractions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [1074, 328, 198, 15353, 133, 2496, 14917, 146, 10075, 723, 373, 11, 427, 483, 6877, 16933, 9, 4476, 390, 2115, 14426, 19, 9, 130, 16504, 1140, 59, 11, 61, 14420, 478, 14417, 9, 195, 14418, 14419]
Decoded: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).

Demonstrating BertTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['this', 'is', 'an', 'example', 'of', 'using', 'custom', 'token', '##izer', '##s', '.', 'it', 'can', 'handle', 'pun', '##ct', '##uation', ',', 'numbers', 'like', '324', 
'##6', ',', 'and', 'contraction', '##s', '(', 'e', '.', 'g', '.', ',', 'don', "'", 't', ',', 'we', "'", 'll', ')', '.']
Encoded: [101, 2023, 2003, 2019, 2742, 1997, 2478, 7661, 19204, 17629, 2015, 1012, 2009, 2064, 5047, 26136, 6593, 14505, 1010, 3616, 2066, 27234, 2575, 1010, 1998, 21963, 2015, 1006, 1041, 1012, 1043, 1012, 1010, 2123, 1005, 1056, 1010, 2057, 1005, 2222, 1007, 1012, 102]
Decoded: [CLS] this is an example of using custom tokenizers. it can handle punctuation, numbers like 3246, and contractions ( e. g., don't, we'll ). [SEP]

Demonstrating GPT2Tokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä custom', 'Ä token', 'izers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punct', 'uation', ',', 'Ä numbers', 'Ä like', 'Ä 3', '246', ',', 'Ä and', 'Ä contract', 'ions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [1212, 318, 281, 1672, 286, 1262, 2183, 11241, 11341, 13, 632, 460, 5412, 21025, 2288, 11, 3146, 588, 513, 26912, 11, 290, 2775, 507, 357, 68, 13, 70, 1539, 836, 470, 11, 
356, 1183, 737]
Decoded: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).

Demonstrating RobertaTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä custom', 'Ä token', 'izers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punct', 'uation', ',', 'Ä numbers', 'Ä like', 'Ä 3', '246', ',', 'Ä and', 'Ä contract', 'ions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [0, 713, 16, 41, 1246, 9, 634, 6777, 19233, 11574, 4, 85, 64, 3679, 15760, 9762, 6, 1530, 101, 155, 30676, 6, 8, 1355, 2485, 36, 242, 4, 571, 482, 218, 75, 6, 52, 581, 322, 2]
Decoded: <s>This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).</s>

Discussion of Output:

  1. Subword Splitting: Notice that the tokenizer splits words like “Hello” and “world”, but keeps punctuation like commas and exclamation marks as separate tokens. Subword tokens like “Ä ” (representing spaces) help ensure that we capture words in their proper context, even if they are rare or have never been seen in training.
  2. Handling Spaces: The Ä  symbol represents spaces, which allows the tokenizer to differentiate between “world” as a standalone word and part of another word. For instance, “Hello” is tokenized as a single word, but " world” is prefixed with “Ä ”, indicating that it is a new word after a space.
  3. Efficient Encoding: The encode() method converts tokens into unique IDs that the model can process. These IDs are then decoded back into the original text using the decode() method. This ensures that tokenization and detokenization are both fast and accurate, maintaining the text’s original structure.

Advantages of Hugging Face Tokenizers in NLP:

  1. Speed: Hugging Face tokenizers are implemented in Rust, making them significantly faster than Python-based tokenizers. This is especially crucial when working with large datasets or deploying models in production.
  2. Consistency: The ability to tokenize and detokenize text consistently is vital for tasks like translation, summarization, or any other generation task. The example shows how spaces, punctuation, and even rare words are handled consistently.
  3. Pre-trained Tokenizers: Hugging Face provides many pre-trained tokenizers (e.g., for BERT, GPT-2, RoBERTa), meaning you don’t always have to start from scratch. These tokenizers are ready to use out of the box for various tasks, saving time and computational resources.
  4. Multilingual Support: Hugging Face tokenizers can handle a wide range of languages, making them versatile for multilingual projects. Many tokenizers can process text from multiple languages with minimal configuration.

Pros and Cons of Hugging Face Tokenizers: Pros:

  • Fast and Memory Efficient: Built using Rust, Hugging Face tokenizers offer high-speed performance and low memory usage.
  • Highly Customizable: You can easily build and train custom tokenizers tailored to specific tasks or datasets.
  • Pre-trained Options: Hugging Face offers pre-trained tokenizers for a wide variety of models, making it easy to integrate with common NLP frameworks.
  • Consistent Tokenization: Ensures that text is tokenized and detokenized consistently, preserving the integrity of the original text. Cons:
  • Learning Curve: While Hugging Face tokenizers are powerful, there can be a learning curve for newcomers, especially when working with advanced features like post-processing and custom training.
  • Overhead for Simple Tasks: For simpler tasks, using Hugging Face tokenizers might be overkill. Basic tokenization methods like whitespace or regex tokenization could suffice in such cases.

Comparing Hugging Face Tokenizers to Other Methods

Compared to the Byte Pair Encoding (BPE) method we discussed earlier, Hugging Face tokenizers take subword tokenization to the next level with optimizations for both speed and memory. They allow you to handle large datasets and complex NLP tasks with ease, while also offering the flexibility to train new tokenizers from scratch or use pre-trained ones.

Whereas BPE alone focuses on vocabulary compression, Hugging Face tokenizers integrate additional features like pre-tokenization and post-processing, which make them ideal for complex NLP tasks such as machine translation, text generation, and summarization.

Wrapping Up: The Power of Hugging Face Tokenizers 🎁

Hugging Face tokenizers offer a powerful, flexible, and efficient way to handle text preprocessing in NLP tasks. Whether you’re building a custom tokenizer from scratch or using a pre-trained model, they provide the tools to handle tokenization at scale with precision and speed. This flexibility is essential for deploying real-world models that need to process massive amounts of text data quickly and accurately.

If you’re working on large language models, machine translation, or any application where efficient text processing is key, Hugging Face tokenizers should be a go-to tool in your NLP toolbox.

SentencePiece Tokenizer: The Polyglot’s Dream 🌍🗣️

As we continue exploring advanced tokenization methods, we arrive at SentencePiece—a highly versatile and language-agnostic tokenizer developed by Google. Unlike traditional tokenizers that rely on whitespace or predefined word boundaries, SentencePiece treats everything, including whitespace, as a token. This makes it incredibly effective for languages that don’t use spaces between words, such as Japanese or Chinese.

What sets SentencePiece apart from other tokenizers is its ability to work directly from raw, unsegmented text. It doesn’t require pre-tokenized input, which simplifies the preprocessing pipeline for many NLP tasks. Moreover, it offers two powerful subword tokenization algorithms: Byte Pair Encoding (BPE) and the Unigram Language Model (ULM).

How SentencePiece Handles Whitespace and Languages Without Spaces

In languages like English, spaces naturally separate words, making tokenization relatively straightforward. However, in languages like Japanese and Chinese, words are often written without spaces, presenting a unique challenge for tokenizers. SentencePiece addresses this by treating whitespace as a token rather than a delimiter. This approach allows it to handle text without predefined word boundaries, ensuring that even languages without spaces are tokenized effectively.

For example, in Japanese, the sentence:

こんにちは世界

will be tokenized in a way that correctly separates the words “こんにちは” (hello) and “世界” (world) without relying on spaces. This capability is one of the reasons SentencePiece has become so popular in multilingual models like Google’s mT5.

SentencePiece’s Two Tokenization Methods: BPE vs. Unigram Language Model (ULM)

SentencePiece offers two primary tokenization methods: Byte Pair Encoding (BPE) and the Unigram Language Model (ULM). Let’s briefly discuss each of these approaches.

  1. Byte Pair Encoding (BPE) BPE, as discussed earlier, iteratively merges the most frequent pairs of characters or subwords to build a vocabulary of increasingly larger tokens. BPE is highly effective in managing vocabulary size and efficiently representing rare words by breaking them into subwords. In SentencePiece, BPE works in a similar way to what we’ve already covered, compressing the vocabulary by merging the most frequent character pairs.

  2. Unigram Language Model (ULM) The Unigram Language Model (ULM) is a probabilistic tokenization approach that works differently from BPE. Instead of starting with single characters and merging pairs, ULM starts with a large set of subword candidates and gradually removes less likely subwords based on their probabilities. The model selects the sequence of tokens that maximizes the probability of the input sentence, according to a unigram distribution.

Here’s how ULM works:

  • Initial Subword Set: ULM starts with a large set of possible subwords.
  • Subword Probabilities: Each subword is assigned a probability based on its frequency in the corpus.
  • Tokenization: The model tokenizes text by selecting the most probable subword sequence, rather than merging pairs as in BPE.
  • Vocabulary Pruning: Over time, the model prunes the vocabulary by eliminating subwords with lower probabilities, leading to a more efficient tokenization scheme.

Unigram is often preferred when you want a more flexible tokenization method that can better adapt to the structure of the language. It is particularly effective in situations where the data contains a mix of languages, scripts, or writing systems.

Code for SentencePiece Tokenizer

Here’s how you can build and train a SentencePiece tokenizer using the Unigram model. SentencePiece makes it easy to switch between BPE and Unigram by changing a single parameter.

import sentencepiece as spm
from src.utils import read_corpus
import os

class CustomSPTokenizer:
    def __init__(self, vocab_size=8000, model_type='unigram', character_coverage=0.9995, max_sentence_length=4192):
        self.vocab_size = vocab_size
        self.model_type = model_type
        self.character_coverage = character_coverage
        self.max_sentence_length = max_sentence_length
        self.sp = None
        self.model_prefix = 'spm_model'

    def train(self, corpus_dir):
        text = read_corpus(corpus_dir)
        with open('temp_corpus.txt', 'w', encoding='utf-8') as f:
            f.write(text)

        spm.SentencePieceTrainer.train(
            input='temp_corpus.txt',
            model_prefix=self.model_prefix,
            vocab_size=self.vocab_size,
            model_type=self.model_type,
            character_coverage=self.character_coverage,
            max_sentence_length=self.max_sentence_length,
            pad_id=0,
            unk_id=1,
            bos_id=2,
            eos_id=3,
            pad_piece='[PAD]',
            unk_piece='[UNK]',
            bos_piece='[BOS]',
            eos_piece='[EOS]'
        )

        self.sp = spm.SentencePieceProcessor()
        self.sp.load(f'{self.model_prefix}.model')

    def save(self, path):
        if self.sp:
            # Copy the trained model file to the specified path
            import shutil
            shutil.copy(f'{self.model_prefix}.model', path)
            print(f"Model saved to {path}")
        else:
            print("No model to save. Train the tokenizer first.")

    def load(self, path):
        self.sp = spm.SentencePieceProcessor()
        self.sp.load(path)

    def encode(self, text):
        return self.sp.encode_as_ids(text)

    def decode(self, ids):
        return self.sp.decode_ids(ids)

    def tokenize(self, text):
        return self.sp.encode_as_pieces(text)

How It Works

  • model_type=‘unigram’: Here we specify that we want to use the Unigram Language Model (ULM) for tokenization. By changing this to ‘bpe’, you can switch back to Byte Pair Encoding (BPE).
  • Character Coverage: The character_coverage parameter ensures that rare characters are included in the model. For instance, setting it to 0.9995 ensures that 99.95% of characters in the corpus are covered by the vocabulary.
  • Special Tokens: We can define special tokens like [PAD], [UNK], [BOS] (beginning of sentence), and [EOS] (end of sentence) to handle specific text structures.

Example: Training and Tokenizing with SentencePiece

Let’s train and test the SentencePiece Tokenizer with the Unigram Language Model on a sample text.

from src.custom_sp_tokenizer import CustomSPTokenizer

def demonstrate_tokenizer(tokenizer, text):
    print("\nDemonstrating SentencePiece Tokenizer functionality:")
    print("Input text:", text)

    tokens = tokenizer.tokenize(text)
    print("Tokenized:", tokens)

    encoded = tokenizer.encode(text)
    print("Encoded:", encoded)

    decoded = tokenizer.decode(encoded)
    print("Decoded:", decoded)

def main():
    corpus_dir = "data/corpus/sp/"
    model_path = "sp_vocab_en_jp.model"

    # Initialize and train the tokenizer
    tokenizer = CustomSPTokenizer(
        vocab_size=300,
        model_type='unigram',  # You can also try 'bpe'
        character_coverage=0.9995,
        max_sentence_length=4192
    )
    print("Training SentencePiece Tokenizer...")
    tokenizer.train(corpus_dir)
    print("SentencePiece Tokenizer trained.")

    # Save the trained model
    tokenizer.save(model_path)

    # Load the trained model
    new_tokenizer = CustomSPTokenizer()
    new_tokenizer.load(model_path)
    print(f"Model loaded from {model_path}")

    # Example text
    example_text = "Hello, world! こんにちは世界!"
    demonstrate_tokenizer(new_tokenizer, example_text)

if __name__ == "__main__":
    main()

Example Output:

Demonstrating SentencePiece Tokenizer functionality:
Input text: Hello, world! こんにち世界は!
Tokenized: ['▁H', 'e', 'l', 'l', 'o', ',', '▁world', '!', '▁', 'こ', 'ん', 'に', 'ち', 'は', '世界', '!']
Encoded: [92, 5, 15, 15, 33, 173, 96, 86, 4, 87, 166, 64, 165, 9, 62, 86]
Decoded: Hello, world! こんにちは世界!

Discussion of Output:

  1. Subword Tokenization: Notice how SentencePiece splits the word “こんにち世界は!” into subwords [‘こ’, ‘ん’, ‘き’, ‘づ’, ‘は’, ‘世界’, ‘!']. This allows the model to handle rare or unseen words by breaking them into meaningful subword units.
  2. Whitespace Handling: The underscore (▁) at the start of tokens represents a space. This ensures that the tokenizer respects word boundaries while still treating whitespace as a meaningful part of the text.
  3. Efficient Encoding: The encode() method converts tokens into IDs, which the model processes. These IDs are decoded back into text with the decode() method, preserving the original sentence structure.

When to Use SentencePiece

SentencePiece excels in several scenarios, making it highly versatile for NLP tasks:

  • Multilingual Models: SentencePiece is language-agnostic, making it ideal for multilingual tasks. Its ability to handle languages without spaces (like Japanese or Chinese) ensures that it can be applied to a wide range of NLP problems.
  • Handling Rare Words: SentencePiece is designed to handle rare words efficiently by breaking them down into subword units, reducing the reliance on an unknown token ().
  • No Pre-tokenization Needed: Unlike other tokenizers, SentencePiece doesn’t require pre-tokenized input, allowing you to work directly with raw text, saving you an extra preprocessing step.

Pros and Cons of SentencePiece:

Pros:

  • Language-agnostic: Works across languages without any special modifications for different writing systems.
  • Handles Whitespace: By treating whitespace as a token, SentencePiece can handle both space-separated -languages and languages without spaces.
  • Flexible: Offers both BPE and Unigram Language Model tokenization, providing flexibility based on your task’s needs.
  • Direct from Raw Text: Trains directly from raw text, eliminating the need for preprocessing steps like word segmentation.

Cons:

  • Training Overhead: Similar to BPE, SentencePiece requires training on a corpus, which can be time-consuming for large datasets.
  • Complexity: While it’s more powerful than basic tokenization methods, it may introduce more complexity than needed for simple tasks.

Wrapping Up: The Versatility of SentencePiece 🎁

SentencePiece is a highly effective tokenization method that handles a wide range of languages, scripts, and writing systems. Whether you’re building a multilingual model or working with text that lacks clear word boundaries, SentencePiece offers a solution that balances flexibility with efficiency. Its support for both BPE and Unigram tokenization makes it versatile enough to handle almost any tokenization need.

For tasks that involve handling rare words, multilingual data, or raw text, SentencePiece is an invaluable tool. It’s no wonder that models like mT5 and XLM-R leverage SentencePiece for their tokenization needs.

Conclusion and Further Reading

We’ve embarked on an exciting journey through the world of tokenization, from the simplicity of whitespace splitting to the complexity of SentencePiece. Each tokenizer we’ve explored represents a step in the evolution of text processing for NLP:

  1. Whitespace Tokenizer: The simplest approach, perfect for quick prototyping.
  2. Regex Tokenizer: Offering flexibility through pattern matching.
  3. Byte Pair Encoding (BPE): Introducing subword tokenization for handling unseen words.
  4. Hugging Face Tokenizer: Showcasing modern, library-based approaches.
  5. SentencePiece: Demonstrating language-agnostic, subword tokenization.

Each method has its strengths and use cases, highlighting the importance of choosing the right tokenizer for your specific task and data.

Remember, while understanding these concepts is crucial, leveraging established libraries like Hugging Face Tokenizers or SentencePiece is often the best approach for real-world applications. These libraries offer optimized implementations and additional features that can significantly improve your NLP projects.

Further Reading 📖

To deepen your understanding of tokenization and stay up-to-date with the latest developments, consider exploring these resources:

  1. BPE Original Paper: “Neural Machine Translation of Rare Words with Subword Units” by Sennrich et al. (2016). Link

    • This paper introduces BPE for NLP, a cornerstone of modern tokenization.
  2. SentencePiece Paper: “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing” by Kudo and Richardson (2018). Link

    • Dive into the details of SentencePiece, a powerful multilingual tokenizer.
  3. Hugging Face Tokenizers Library: Official Documentation

    • Explore the features and capabilities of this popular tokenization library.
  4. “Subword Regularization” by Kudo (2018): Link

    • Learn about advanced techniques for improving subword tokenization.
  5. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Devlin et al. (2019): Link

    • While not strictly about tokenization, this paper discusses WordPiece, another important subword tokenization method.

Final Thoughts 💭

As we conclude our tokenization adventure, remember that the field of NLP is constantly evolving. New tokenization methods and improvements to existing ones are always on the horizon. The key is to understand the fundamental concepts we’ve covered here and stay curious about new developments.

Whether you’re building a chatbot, working on machine translation, or diving into sentiment analysis, effective tokenization is your first step towards success in NLP. So go forth, tokenize wisely, and may your models be ever accurate and your training times short!