Welcome, language enthusiasts and curious coders! đ Are you ready to embark on a fascinating journey into the world of Natural Language Processing (NLP)? Buckle up, because weâre about to dive deep into tokenizationâa fundamental building block of NLP models!
This blog post was inspired by Andrej Karpathy’s excellent video “Let’s build the GPT Tokenizer”.which I highly recommend if you want a deeper, hands-on exploration of tokenization.
Imagine trying to teach a computer to understand Shakespeare or to interpret the latest tweet from your favorite influencer. Sounds challenging, right? Thatâs exactly what NLP aims to doâand tokenization is our trusty sidekick on this quest. Itâs the crucial first step that bridges the chaotic, nuanced world of human language and the structured, numerical format that machines understand.
In this blog, we’ll guide you through the diverse landscape of tokenization. From the simplest method of splitting text by whitespace to the more advanced techniques like subword tokenization used by models such as GPT and BERT, weâll not only cover the theory but also provide practical code examples. Together, weâll implement various tokenizers from scratch!
This blog post comes with a full set of code implementations to help you follow along. Iâve prepared a GitHub repository with all the tokenizers we’ll discuss.
Repository Name: Tokenizer-Tutorial
URL: https://github.com/verdugo-danieML/Tokenizer-Tutorial
Hereâs how to get started:
git clone https://github.com/verdugo-danieML/Tokenizer-Tutorial.git
pip install -r requirements.txt
You can find detailed usage instructions in the repository’s README.md.
Letâs dive in and start our exploration of the fascinating world of tokenization!
Imagine you’re a linguist, tasked with deciphering an ancient script. You begin by breaking the continuous stream of symbols into meaningful unitsâwords, phrases, or even individual characters. This is tokenization in a nutshell. In NLP, tokenization serves a similar role: it acts as the bridge between raw, unprocessed text and the structured data that machines can understand.
At its core, tokenization is the process of converting a sequence of characters into meaningful chunks, known as tokens. These tokens can be words, subwords, or even individual characters depending on the method chosen. Tokenization is the foundation upon which all higher-level NLP tasks are built, from sentiment analysis to machine translation.
Tokenization isnât a new concept. It dates back to the early days of computer science and information retrieval in the 1950s and 60s. As researchers began developing systems to process text using computers, they quickly realized the need to break text into smaller, manageable units for analysis.
Early tokenization methods were rudimentaryâoften just splitting text by spaces or punctuation. However, as NLP techniques advanced, so did tokenization.
Bridge between humans and machines: Tokenization converts human-readable text into a format that computers can process efficiently.
Vocabulary management: It helps create a finite vocabulary from an infinite set of possible word combinations.
Handling out-of-vocabulary (OOV) words: Advanced tokenization methods, such as subword tokenization, can handle words not seen during training.
Language agnosticism: Certain tokenization methods work across multiple languages without needing custom rules.
Improved model performance: Effective tokenization can drastically improve the performance of NLP models.
Tokenization techniques span a spectrum, balancing vocabulary size with the granularity of the tokens:
â Fun Fact #3: The word âtokenizationâ would be split differently by these three methods:
Character-level: [’t', ‘o’, ‘k’, ‘e’, ‘n’, ‘i’, ‘z’, ‘a’, ’t', ‘i’, ‘o’, ‘n’]
Word-level: [‘tokenization’]
Subword-level (e.g., BPE): [‘token’, ‘ization’]
While tokenization seems straightforward, it presents some interesting challenges:
Tokenization plays a critical role in the performance of modern language models like GPT-3 and BERT. These models rely heavily on subword tokenization strategies to efficiently represent vast amounts of text data while handling rare words and preventing vocabulary explosion.
As we explore various tokenization methods in this blog post, keep in mind that each technique offers a different solution to the core challenges of tokenization. From the simplicity of splitting on whitespace to the sophistication of learned subword tokenization, each method has a place in the NLP toolkit.
By the end of this journey, youâll have a comprehensive understanding of tokenization and the practical skills to implement and apply these methods in your own NLP projects. Letâs get started!
The Whitespace Tokenizer is the most straightforward method for breaking down text. It simply splits the text based on spaces. While itâs easy to implement and often sufficient for quick prototyping, it has some limitations that make it less ideal for more complex NLP tasks.
Imagine we have a simple sentence:
"The quick brown fox jumps over the lazy dog!"
Using whitespace tokenization, the sentence would be split into tokens at each space:
tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog!"]
However, notice how the punctuation ("!") remains attached to the last word (“dog!"). This can be problematic because punctuation often needs to be handled separately, especially in tasks like sentiment analysis, machine translation, or question answering. In this case, “dog!” is treated as a single token, potentially leading to suboptimal performance.
class WhitespaceTokenizer:
def __init__(self):
self.vocab = {}
self.inverse_vocab = {}
def fit(self, text: str):
tokens = self.tokenize(text)
self.vocab = {token: i for i, token in enumerate(set(tokens))}
self.inverse_vocab = {i: token for token, i in self.vocab.items()}
def tokenize(self, text: str) -> List[str]:
return text.split()
def encode(self, text: str) -> List[int]:
tokens = self.tokenize(text)
return [self.vocab.get(token, len(self.vocab)) for token in tokens]
def decode(self, token_ids: List[int]) -> str:
tokens = [self.inverse_vocab.get(id, '<unk>') for id in token_ids]
return self.detokenize(tokens)
def detokenize(self, tokens: List[str]) -> str:
return " ".join(tokens)
Pros:
Cons:
Example:
text = "The quick brown fox jumps over the lazy dog!"
tokenizer = WhitespaceTokenizer()
tokenizer.fit(text)
tokens = tokenizer.tokenize(text)
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print("Original:", text)
print("Tokenized:", tokens)
print("Encoded:", encoded)
print("Decoded:", decoded)
Output:
Original: The quick brown fox jumps over the lazy dog!
Tokenized: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog!']
Encoded: [1, 7, 2, 5, 8, 4, 6, 0, 3]
Decoded: The quick brown fox jumps over the lazy dog!
As you can see, the tokenizer splits on spaces, but “dog!” remains a single token, including punctuation. This might not be a problem for very simple tasks, but for more nuanced NLP applications, separating punctuation can be essential.
However, for more complex tasks or languages, you’ll likely need a more advanced tokenizer that can handle punctuation properly.
While the Whitespace Tokenizer splits text only by spaces, the Regex Tokenizer is much more flexible. By using regular expressions, we can define custom patterns for tokenizing text, allowing us to handle punctuation, contractions, and more complex linguistic structures.
Regular expressions (regex) provide a powerful way to define specific patterns in text. With a Regex Tokenizer, you can go beyond simply splitting text at spaces and handle more sophisticated cases, such as:
Imagine you have this sentence:
"Hello, world! It's 2024."
Using a well-crafted regex, we can split this into:
["Hello", ",", "world", "!", "It", "'s", "2024", "."]
Notice how punctuation is now separated into its own tokens, unlike with the Whitespace Tokenizer.
Here’s an implementation of a simple Regex Tokenizer:
import regex
from typing import List, Dict
from src.utils import save_vocab, load_vocab
class RegexTokenizer:
PATTERNS: Dict[str, str] = {
'basic': r'\b\w+\b|\S',
'gpt2': r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
'gpt4': r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
'improved': r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}+(?:[.,]\p{N}+)?| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
}
def __init__(self, pattern: str = 'basic'):
if pattern in self.PATTERNS:
self.pattern = regex.compile(self.PATTERNS[pattern])
else:
self.pattern = regex.compile(pattern)
self.vocab = {}
self.inverse_vocab = {}
def fit(self, text: str):
"""Build vocabulary from the given text."""
tokens = self.tokenize(text)
# Create a set of unique tokens, preserving leading spaces
unique_tokens = set(tokens)
# Create vocabulary, assigning each unique token an index
self.vocab = {token: i for i, token in enumerate(unique_tokens)}
self.inverse_vocab = {i: token for token, i in self.vocab.items()}
def tokenize(self, text: str) -> List[str]:
"""Tokenize the input text using the specified regex pattern."""
return self.pattern.findall(text)
# encode, decode, and detokenize methods are similar to WhitespaceTokenizer
This regex pattern, \b\w+\b|\S, splits words while also ensuring that punctuation and symbols are treated as separate tokens.
from src.regex_tokenizer import RegexTokenizer
tokenizer = RegexTokenizer(pattern='gpt2')
text = "Don't you love tokenizers? They're amazing!"
tokenizer.fit(text)
tokens = tokenizer.tokenize(text)
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print("Original:", text)
print("Tokenized:", tokens)
print("Encoded:", encoded)
print("Decoded:", decoded)
Output:
Original: Don't you love tokenizers? They're amazing!
Tokenized: ['Don', "'t", ' you', ' love', ' tokenizers', '?', ' They', "'re", ' amazing', '!']
Encoded: [9, 4, 5, 0, 8, 2, 3, 1, 7, 6]
Decoded: Don't you love tokenizers? They're amazing!
As you can see, the Regex Tokenizer handles punctuation correctly by separating it into individual tokens. It also splits contractions like “Don’t” and “They’re” into “Don” and “t”, and “They” and “re”, which is crucial for many NLP tasks, including machine translation and sentiment analysis.
Pros:
Cons:
Note: Despite its flexibility, regex tokenization can be overkill for simpler tasks. For basic tokenization, you might not need such fine-grained control.
Consider the following scenario: you’re working on a task that involves processing product reviews. Customers might use abbreviations, contractions, and punctuation in various ways. A Whitespace Tokenizer may fail to split “great!” from “great”, but a Regex Tokenizer can separate these, making it much more effective for downstream sentiment analysis tasks.
For example:
"I love this product!!! It's amazing!"
A Regex Tokenizer can split this into:
['I', 'love', 'this', 'product', '!', '!', '!', 'It', "'s", 'amazing', '!']
With punctuation separated from words, sentiment analysis models can weigh the exclamations more heavily, which can improve the accuracy of detecting sentiment.
Now that weâve covered basic tokenization methods like whitespace and regex, itâs time to dive deeper into one of the most powerful and widely used techniques: Byte Pair Encoding (BPE). BPE strikes a balance between character-level and word-level tokenization, making it ideal for handling rare words and out-of-vocabulary (OOV) tokens in modern NLP models.
BPE is a subword tokenization technique that starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of characters or subwords to form longer tokens. The goal is to achieve a compact vocabulary that captures both common words and parts of rare or unseen words, making it highly efficient for large text corpora.
Here’s the high-level process:
By merging subwords, BPE creates tokens that are large enough to capture common words, but small enough to handle rare or OOV words by breaking them into meaningful subword units.
This is why BPE is widely used in models like GPT, BERT, and other transformer-based modelsâit allows the model to represent rare words efficiently without blowing up the vocabulary size.
Hereâs an implementation of the BPE tokenizer:
from typing import List, Dict, Tuple
from collections import defaultdict
import re
from src.utils import read_corpus, save_vocab, load_vocab
import json
class BPETokenizer:
def __init__(self, vocab_size: int = 1000):
self.vocab_size = vocab_size
self.vocab = {"<unk>": 0, "<s>": 1, "</s>": 2}
self.inverse_vocab = {0: "<unk>", 1: "<s>", 2: "</s>"}
self.merges = {}
def train(self, corpus_dir: str):
corpus = read_corpus(corpus_dir)
word_freqs = defaultdict(int)
for word in corpus.split():
word = ' '.join(word) + ' </w>'
word_freqs[word] += 1
for word in word_freqs:
for char in word.split():
if char not in self.vocab:
self.vocab[char] = len(self.vocab)
self.inverse_vocab[self.vocab[char]] = char
num_merges = self.vocab_size - len(self.vocab)
for i in range(num_merges):
pairs = self._get_stats(word_freqs)
if not pairs:
break
best = max(pairs, key=pairs.get)
self._merge_vocab(best, word_freqs)
if len(self.vocab) >= self.vocab_size:
break
def _get_stats(self, word_freqs: Dict[str, int]) -> Dict[Tuple[str, str], int]:
pairs = defaultdict(int)
for word, freq in word_freqs.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i + 1]] += freq
return pairs
def _merge_vocab(self, pair: Tuple[str, str], word_freqs: Dict[str, int]):
bigram = ' '.join(pair)
replacement = ''.join(pair)
self.merges[bigram] = replacement
self.vocab[replacement] = len(self.vocab)
self.inverse_vocab[self.vocab[replacement]] = replacement
new_word_freqs = {}
for word, freq in word_freqs.items():
new_word = word.replace(bigram, replacement)
new_word_freqs[new_word] = freq
word_freqs.clear()
word_freqs.update(new_word_freqs)
def tokenize(self, text: str) -> List[str]:
words = text.lower().split()
tokens = []
for word in words:
word = ' '.join(word) + ' </w>'
while True:
subwords = word.split()
if len(subwords) == 1:
break
i = 0
while i < len(subwords) - 1:
bigram = ' '.join(subwords[i:i+2])
if bigram in self.merges:
subwords[i] = self.merges[bigram]
del subwords[i+1]
else:
i += 1
new_word = ' '.join(subwords)
if new_word == word:
break
word = new_word
tokens.extend(word.split())
return tokens
# encode, decode methods follow...
Letâs walk through a practical example to illustrate how BPE tokenization works in action.
Input Text:
"The quick brown fox jumps over the lazy dog"
BPE Tokenization Process:
Initial Vocabulary: At first, each character is treated as a token:
['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x']
Merging Frequent Pairs: The most frequent pairs of characters are merged into subwords. For example, frequent combinations like “Th”, “qu”, “ick”, and “fox” may become new subword tokens.
Final Tokens:
['the</w>', 'qui', 'ck', '</w>', 'bro', 'w', 'n</w>', 'f', 'o', 'x</w>', 'ju', 'm', 'ps</w>', 'ov', 'er</w>', 'the</w>', 'la', 'z', 'y</w>', 'dog', '</w>']
Example Output:
text = "The quick brown fox jumps over the lazy dog"
tokenizer = BPETokenizer(vocab_size=1000)
tokenizer.train(text)
tokens = tokenizer.tokenize(text)
print("Tokenized:", tokens)
Output:
Tokenized: ['the</w>', 'qui', 'ck', '</w>', 'bro', 'w', 'n</w>', 'f', 'o', 'x</w>', 'ju', 'm', 'ps</w>', 'ov', 'er</w>', 'the</w>', 'la', 'z', 'y</w>', 'dog', '</w>']
In this example, you can see how BPE gradually merges frequent character pairs to form subwords like “qui”, “ck”, and “the”. The special token indicates the end of a word, allowing BPE to preserve word boundaries. This is particularly useful for handling rare or unseen words, as BPE can break them down into smaller, meaningful subword units.
For instance, if the word “lazydog” appeared later in the text but wasn’t part of the original training data, BPE could tokenize it as:
['la', 'z', 'y</w>', 'dog', '</w>']
This capability allows BPE to handle words it hasnât seen before without resorting to an unknown token (), which is crucial for dealing with rare words in NLP tasks like machine translation and language modeling.
BPE is highly effective because it provides a flexible, scalable way to handle vast amounts of text data. Itâs the method of choice in many state-of-the-art models like GPT-2, GPT-3, and BERT. By breaking down words into subwords, BPE helps these models:
Pros:
By understanding how BPE works and why it’s so widely used, you can better appreciate its role in modern NLP. Whether you’re dealing with rare words, out-of-vocabulary terms, or just aiming for a more efficient model, BPE is a powerful tool in your NLP toolkit.
As we move further into modern NLP, we encounter advanced tokenizers like the Hugging Face Tokenizer. Hugging Faceâs tokenizers library provides a high-performance, flexible, and easy-to-use framework for handling tokenization in large language models like BERT, GPT-2, RoBERTa, and more.
What makes Hugging Faceâs tokenizer unique is its ability to efficiently handle subword tokenization, vocabulary management, and encoding, while also being highly optimized for speed. By using Rust at its core, Hugging Face’s tokenizers library offers fast and memory-efficient tokenization compared to standard Python implementations. Itâs designed for both training new tokenizers and using pre-trained ones, making it a go-to solution for modern NLP tasks.
While weâve already explored methods like whitespace tokenization and BPE, Hugging Face tokenizers come with a range of pre-trained tokenizers that have been optimized for different models and use cases. Here are some reasons why Hugging Face tokenizers are so powerful:
Letâs explore how to build a custom Hugging Face tokenizer using Byte-Level BPE, one of the popular subword tokenization methods.
Hereâs a simplified version of how you can create and train a custom Hugging Face tokenizer:
from tokenizers import Tokenizer, models, pre_tokenizers, processors, decoders, trainers
from src.utils import read_corpus
class CustomHFTokenizer:
def __init__(self, vocab_size=25000):
self.tokenizer = Tokenizer(models.BPE())
self.tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
self.trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=["<|endoftext|>"],
show_progress=True
)
self.tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
self.tokenizer.decoder = decoders.ByteLevel()
def train(self, corpus_dir):
corpus = read_corpus(corpus_dir)
self.tokenizer.train_from_iterator([corpus], trainer=self.trainer)
def save(self, path):
self.tokenizer.save(path)
def load(self, path):
self.tokenizer = Tokenizer.from_file(path)
def encode(self, text):
return self.tokenizer.encode(text).ids
def decode(self, ids):
return self.tokenizer.decode(ids)
def tokenize(self, text):
return self.tokenizer.encode(text).tokens
How It Works
Letâs see how the Hugging Face tokenizer works in action.
from transformers import BertTokenizer, GPT2Tokenizer, RobertaTokenizer, PreTrainedTokenizerFast
from src.custom_hf_tokenizer import CustomHFTokenizer
from src.utils import read_corpus
def demonstrate_tokenizer(tokenizer, text):
print(f"\nDemonstrating {tokenizer.__class__.__name__} functionality:")
print("Input text:", text)
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print("Tokenized:", tokens)
print("Encoded:", ids)
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)
def main():
corpus_dir = "data/corpus/"
# Example text
example_text = "This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll)."
# Custom HF Tokenizer
custom_tokenizer = CustomHFTokenizer(vocab_size=25000)
custom_tokenizer.train(corpus_dir)
custom_tokenizer.save("custom_tokenizer.json")
demonstrate_tokenizer(custom_tokenizer, example_text)
# BERT Tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
demonstrate_tokenizer(bert_tokenizer, example_text)
# GPT-2 Tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
demonstrate_tokenizer(gpt2_tokenizer, example_text)
# RoBERTa Tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
demonstrate_tokenizer(roberta_tokenizer, example_text)
if __name__ == "__main__":
main()
Example Output:
Demonstrating CustomHFTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä cust', 'om', 'Ä token', 'iz', 'ers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punctuation', ',', 'Ä numbers', 'Ä like', 'Ä 3',
'24', '6', ',', 'Ä and', 'Ä contractions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [1074, 328, 198, 15353, 133, 2496, 14917, 146, 10075, 723, 373, 11, 427, 483, 6877, 16933, 9, 4476, 390, 2115, 14426, 19, 9, 130, 16504, 1140, 59, 11, 61, 14420, 478, 14417, 9, 195, 14418, 14419]
Decoded: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Demonstrating BertTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['this', 'is', 'an', 'example', 'of', 'using', 'custom', 'token', '##izer', '##s', '.', 'it', 'can', 'handle', 'pun', '##ct', '##uation', ',', 'numbers', 'like', '324',
'##6', ',', 'and', 'contraction', '##s', '(', 'e', '.', 'g', '.', ',', 'don', "'", 't', ',', 'we', "'", 'll', ')', '.']
Encoded: [101, 2023, 2003, 2019, 2742, 1997, 2478, 7661, 19204, 17629, 2015, 1012, 2009, 2064, 5047, 26136, 6593, 14505, 1010, 3616, 2066, 27234, 2575, 1010, 1998, 21963, 2015, 1006, 1041, 1012, 1043, 1012, 1010, 2123, 1005, 1056, 1010, 2057, 1005, 2222, 1007, 1012, 102]
Decoded: [CLS] this is an example of using custom tokenizers. it can handle punctuation, numbers like 3246, and contractions ( e. g., don't, we'll ). [SEP]
Demonstrating GPT2Tokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä custom', 'Ä token', 'izers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punct', 'uation', ',', 'Ä numbers', 'Ä like', 'Ä 3', '246', ',', 'Ä and', 'Ä contract', 'ions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [1212, 318, 281, 1672, 286, 1262, 2183, 11241, 11341, 13, 632, 460, 5412, 21025, 2288, 11, 3146, 588, 513, 26912, 11, 290, 2775, 507, 357, 68, 13, 70, 1539, 836, 470, 11,
356, 1183, 737]
Decoded: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Demonstrating RobertaTokenizer functionality:
Input text: This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).
Tokenized: ['This', 'Ä is', 'Ä an', 'Ä example', 'Ä of', 'Ä using', 'Ä custom', 'Ä token', 'izers', '.', 'Ä It', 'Ä can', 'Ä handle', 'Ä punct', 'uation', ',', 'Ä numbers', 'Ä like', 'Ä 3', '246', ',', 'Ä and', 'Ä contract', 'ions', 'Ä (', 'e', '.', 'g', '.,', 'Ä don', "'t", ',', 'Ä we', "'ll", ').']
Encoded: [0, 713, 16, 41, 1246, 9, 634, 6777, 19233, 11574, 4, 85, 64, 3679, 15760, 9762, 6, 1530, 101, 155, 30676, 6, 8, 1355, 2485, 36, 242, 4, 571, 482, 218, 75, 6, 52, 581, 322, 2]
Decoded: <s>This is an example of using custom tokenizers. It can handle punctuation, numbers like 3246, and contractions (e.g., don't, we'll).</s>
Pros and Cons of Hugging Face Tokenizers: Pros:
Compared to the Byte Pair Encoding (BPE) method we discussed earlier, Hugging Face tokenizers take subword tokenization to the next level with optimizations for both speed and memory. They allow you to handle large datasets and complex NLP tasks with ease, while also offering the flexibility to train new tokenizers from scratch or use pre-trained ones.
Whereas BPE alone focuses on vocabulary compression, Hugging Face tokenizers integrate additional features like pre-tokenization and post-processing, which make them ideal for complex NLP tasks such as machine translation, text generation, and summarization.
Hugging Face tokenizers offer a powerful, flexible, and efficient way to handle text preprocessing in NLP tasks. Whether you’re building a custom tokenizer from scratch or using a pre-trained model, they provide the tools to handle tokenization at scale with precision and speed. This flexibility is essential for deploying real-world models that need to process massive amounts of text data quickly and accurately.
If youâre working on large language models, machine translation, or any application where efficient text processing is key, Hugging Face tokenizers should be a go-to tool in your NLP toolbox.
As we continue exploring advanced tokenization methods, we arrive at SentencePieceâa highly versatile and language-agnostic tokenizer developed by Google. Unlike traditional tokenizers that rely on whitespace or predefined word boundaries, SentencePiece treats everything, including whitespace, as a token. This makes it incredibly effective for languages that donât use spaces between words, such as Japanese or Chinese.
What sets SentencePiece apart from other tokenizers is its ability to work directly from raw, unsegmented text. It doesnât require pre-tokenized input, which simplifies the preprocessing pipeline for many NLP tasks. Moreover, it offers two powerful subword tokenization algorithms: Byte Pair Encoding (BPE) and the Unigram Language Model (ULM).
In languages like English, spaces naturally separate words, making tokenization relatively straightforward. However, in languages like Japanese and Chinese, words are often written without spaces, presenting a unique challenge for tokenizers. SentencePiece addresses this by treating whitespace as a token rather than a delimiter. This approach allows it to handle text without predefined word boundaries, ensuring that even languages without spaces are tokenized effectively.
For example, in Japanese, the sentence:
ăăăŤăĄăŻä¸ç
will be tokenized in a way that correctly separates the words “ăăăŤăĄăŻ” (hello) and “ä¸ç” (world) without relying on spaces. This capability is one of the reasons SentencePiece has become so popular in multilingual models like Googleâs mT5.
SentencePiece offers two primary tokenization methods: Byte Pair Encoding (BPE) and the Unigram Language Model (ULM). Letâs briefly discuss each of these approaches.
Byte Pair Encoding (BPE) BPE, as discussed earlier, iteratively merges the most frequent pairs of characters or subwords to build a vocabulary of increasingly larger tokens. BPE is highly effective in managing vocabulary size and efficiently representing rare words by breaking them into subwords. In SentencePiece, BPE works in a similar way to what weâve already covered, compressing the vocabulary by merging the most frequent character pairs.
Unigram Language Model (ULM) The Unigram Language Model (ULM) is a probabilistic tokenization approach that works differently from BPE. Instead of starting with single characters and merging pairs, ULM starts with a large set of subword candidates and gradually removes less likely subwords based on their probabilities. The model selects the sequence of tokens that maximizes the probability of the input sentence, according to a unigram distribution.
Hereâs how ULM works:
Unigram is often preferred when you want a more flexible tokenization method that can better adapt to the structure of the language. It is particularly effective in situations where the data contains a mix of languages, scripts, or writing systems.
Hereâs how you can build and train a SentencePiece tokenizer using the Unigram model. SentencePiece makes it easy to switch between BPE and Unigram by changing a single parameter.
import sentencepiece as spm
from src.utils import read_corpus
import os
class CustomSPTokenizer:
def __init__(self, vocab_size=8000, model_type='unigram', character_coverage=0.9995, max_sentence_length=4192):
self.vocab_size = vocab_size
self.model_type = model_type
self.character_coverage = character_coverage
self.max_sentence_length = max_sentence_length
self.sp = None
self.model_prefix = 'spm_model'
def train(self, corpus_dir):
text = read_corpus(corpus_dir)
with open('temp_corpus.txt', 'w', encoding='utf-8') as f:
f.write(text)
spm.SentencePieceTrainer.train(
input='temp_corpus.txt',
model_prefix=self.model_prefix,
vocab_size=self.vocab_size,
model_type=self.model_type,
character_coverage=self.character_coverage,
max_sentence_length=self.max_sentence_length,
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_piece='[PAD]',
unk_piece='[UNK]',
bos_piece='[BOS]',
eos_piece='[EOS]'
)
self.sp = spm.SentencePieceProcessor()
self.sp.load(f'{self.model_prefix}.model')
def save(self, path):
if self.sp:
# Copy the trained model file to the specified path
import shutil
shutil.copy(f'{self.model_prefix}.model', path)
print(f"Model saved to {path}")
else:
print("No model to save. Train the tokenizer first.")
def load(self, path):
self.sp = spm.SentencePieceProcessor()
self.sp.load(path)
def encode(self, text):
return self.sp.encode_as_ids(text)
def decode(self, ids):
return self.sp.decode_ids(ids)
def tokenize(self, text):
return self.sp.encode_as_pieces(text)
How It Works
Letâs train and test the SentencePiece Tokenizer with the Unigram Language Model on a sample text.
from src.custom_sp_tokenizer import CustomSPTokenizer
def demonstrate_tokenizer(tokenizer, text):
print("\nDemonstrating SentencePiece Tokenizer functionality:")
print("Input text:", text)
tokens = tokenizer.tokenize(text)
print("Tokenized:", tokens)
encoded = tokenizer.encode(text)
print("Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("Decoded:", decoded)
def main():
corpus_dir = "data/corpus/sp/"
model_path = "sp_vocab_en_jp.model"
# Initialize and train the tokenizer
tokenizer = CustomSPTokenizer(
vocab_size=300,
model_type='unigram', # You can also try 'bpe'
character_coverage=0.9995,
max_sentence_length=4192
)
print("Training SentencePiece Tokenizer...")
tokenizer.train(corpus_dir)
print("SentencePiece Tokenizer trained.")
# Save the trained model
tokenizer.save(model_path)
# Load the trained model
new_tokenizer = CustomSPTokenizer()
new_tokenizer.load(model_path)
print(f"Model loaded from {model_path}")
# Example text
example_text = "Hello, world! ăăăŤăĄăŻä¸çďź"
demonstrate_tokenizer(new_tokenizer, example_text)
if __name__ == "__main__":
main()
Example Output:
Demonstrating SentencePiece Tokenizer functionality:
Input text: Hello, world! ăăăŤăĄä¸çăŻďź
Tokenized: ['âH', 'e', 'l', 'l', 'o', ',', 'âworld', '!', 'â', 'ă', 'ă', 'ăŤ', 'ăĄ', 'ăŻ', 'ä¸ç', '!']
Encoded: [92, 5, 15, 15, 33, 173, 96, 86, 4, 87, 166, 64, 165, 9, 62, 86]
Decoded: Hello, world! ăăăŤăĄăŻä¸ç!
SentencePiece excels in several scenarios, making it highly versatile for NLP tasks:
Pros:
Cons:
SentencePiece is a highly effective tokenization method that handles a wide range of languages, scripts, and writing systems. Whether youâre building a multilingual model or working with text that lacks clear word boundaries, SentencePiece offers a solution that balances flexibility with efficiency. Its support for both BPE and Unigram tokenization makes it versatile enough to handle almost any tokenization need.
For tasks that involve handling rare words, multilingual data, or raw text, SentencePiece is an invaluable tool. Itâs no wonder that models like mT5 and XLM-R leverage SentencePiece for their tokenization needs.
We’ve embarked on an exciting journey through the world of tokenization, from the simplicity of whitespace splitting to the complexity of SentencePiece. Each tokenizer we’ve explored represents a step in the evolution of text processing for NLP:
Each method has its strengths and use cases, highlighting the importance of choosing the right tokenizer for your specific task and data.
Remember, while understanding these concepts is crucial, leveraging established libraries like Hugging Face Tokenizers or SentencePiece is often the best approach for real-world applications. These libraries offer optimized implementations and additional features that can significantly improve your NLP projects.
To deepen your understanding of tokenization and stay up-to-date with the latest developments, consider exploring these resources:
BPE Original Paper: “Neural Machine Translation of Rare Words with Subword Units” by Sennrich et al. (2016). Link
SentencePiece Paper: “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing” by Kudo and Richardson (2018). Link
Hugging Face Tokenizers Library: Official Documentation
“Subword Regularization” by Kudo (2018): Link
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Devlin et al. (2019): Link
As we conclude our tokenization adventure, remember that the field of NLP is constantly evolving. New tokenization methods and improvements to existing ones are always on the horizon. The key is to understand the fundamental concepts we’ve covered here and stay curious about new developments.
Whether you’re building a chatbot, working on machine translation, or diving into sentiment analysis, effective tokenization is your first step towards success in NLP. So go forth, tokenize wisely, and may your models be ever accurate and your training times short!