" MicromOne: Text Generation with PyTorch: Character-Level vs Subword Tokenization

Pagine

Text Generation with PyTorch: Character-Level vs Subword Tokenization

Artificial Intelligence has made enormous progress in Natural Language Processing (NLP), especially in the area of text generation. Modern systems can generate realistic text by learning patterns from large datasets.

In this tutorial, we will build a text generation model using PyTorch trained on a dataset containing works from Shakespeare. We will explore two different approaches:

  1. Character-level text generation

  2. Subword tokenization using Hugging Face

By the end of this article, you will understand how language models generate text token by token.

Dataset: Shakespeare Text

To train the model, we use a dataset containing Shakespeare's writings.

DATA_FILE = '../data/shakespeare_small.txt'

with open(DATA_FILE, 'r') as data_file:
    raw_text = data_file.read()

print(f'Number of characters in text file: {len(raw_text):,}')

This dataset contains thousands of characters that will be used to train our neural network to predict the next token in a sequence.

Character-Based Text Generation

In the first approach, each character is treated as a token.

Example:

"hello" → ['h', 'e', 'l', 'l', 'o']

The model learns to predict the next character based on the previous ones.

Step 1 — Text Normalization

Before tokenization, we normalize the text.

def normalize_text(text: str) -> str:
    normalized_text = text.lower()
    return normalized_text

This converts all text to lowercase while keeping punctuation and special characters.

Step 2 — Pretokenization

Next, we split the text into characters.

def pretokenize_text(text: str) -> list[str]:
    smaller_pieces = [char for char in text]
    return smaller_pieces

Step 3 — Tokenization

Now we combine normalization and pretokenization.

def tokenize_text(text: str) -> list[str]:
    normalized_text = normalize_text(text)
    pretokenized_text = pretokenize_text(normalized_text)
    tokenized_text = pretokenized_text
    return tokenized_text

Step 4 — Encoding Tokens into IDs

We convert tokens into integer IDs.

encoded_text, character_mapping = encode_text(raw_text, tokenize_text)

The mapping allows conversion between characters and integer IDs.

character → integer ID
integer ID → character

Preparing the Dataset

Next, we split the encoded text into sequences.

sequence_length = 32
batch_size = 32

train_dataset = ShakespeareDataset(encoded_text, sequence_length)

train_loader = DataLoader(
    train_dataset,
    shuffle=False,
    batch_size=batch_size,
)

Each sequence contains 32 characters, and the model learns to predict the next character.

Building the Model

We create a neural network using a helper function.

model = build_model(n_tokens)
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

CrossEntropyLoss is used for classification tasks, while the Adam optimizer updates model weights during training.

Text Generation Function (Character-Based)

The following function generates text character by character.

def generate_text_by_char(
    input_str: str,
    model,
    token_mapping,
    num_chars: int = 100,
    temperature: float = 1.0,
    topk: int | None = None,
):

    tokenized_text = tokenize_text(input_str)
    generated_tokens = []

    for _ in range(num_chars):

        new_char = next_token(
            tokenized_text=(tokenized_text + generated_tokens),
            model=model,
            token_mapping=token_mapping,
            temperature=temperature,
            topk=topk,
            device=device,
        )

        generated_tokens.append(new_char)

    full_text = ''.join(tokenized_text + generated_tokens)
    return full_text

Training the Model

Now we train the model.

TEST_PHRASE = 'To be or not to be'
epochs = 5 if device == 'cpu' else 25

start = start_time()

for epoch in range(epochs):

    model.train()
    total_loss = 0

    for X_batch, y_batch in train_loader:

        optimizer.zero_grad()

        output = model(X_batch.to(device))

        loss = criterion(output.transpose(1, 2), y_batch.to(device))

        loss.backward()

        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_loader)}')

    gen_output = generate_text_by_char(
        input_str=TEST_PHRASE,
        model=model,
        num_chars=100,
    )

    print(gen_output)

During training, we generate text after each epoch to observe improvements.

Generating Text

Once the model is trained, we can generate new text.

output = generate_text_by_char(
    input_str='To be or not to be',
    model=model,
    num_chars=100,
    temperature=1.0,
    topk=None,
)

print(output)

Adjusting parameters changes creativity.

Temperature controls randomness, while top-k sampling restricts token selection.

Subword-Based Text Generation

Character-based models are simple but inefficient. Modern NLP models use subword tokenization.

We use a tokenizer from the Transformers library.

Loading a Hugging Face Tokenizer

from transformers import AutoTokenizer

model_name = 'bert-base-uncased'

my_tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

This tokenizer splits words into subword units.

Example:

playing → play + ##ing

Encoding Text with the Tokenizer

encoded_text, token_mapping = encode_text_from_tokenizer(
    text=raw_text,
    tokenizer=my_tokenizer,
)

Each token is converted into an integer ID.

Preparing the Dataset

sequence_length = 32
batch_size = 32

train_dataset = ShakespeareDataset(encoded_text, sequence_length)

train_loader = DataLoader(
    train_dataset,
    shuffle=False,
    batch_size=batch_size,
)

Subword Text Generation Function

def generate_text_by_subword(
    input_str: str,
    model,
    token_mapping,
    tokenizer,
    num_tokens: int = 100,
    temperature: float = 1.0,
    topk: int | None = None,
):

    tokenized_text = tokenize_text_from_tokenizer(
        tokenizer=tokenizer,
        text=input_str,
    )

    generated_tokens = []

    for _ in range(num_tokens):

        new_token = next_token(
            tokenized_text=(tokenized_text + generated_tokens),
            model=model,
            token_mapping=token_mapping,
            temperature=temperature,
            topk=topk,
            device=device,
        )

        generated_tokens.append(new_token)

    output_ids = tokenizer.convert_tokens_to_ids(
        tokenized_text + generated_tokens
    )

    full_text = tokenizer.decode(output_ids)

    return full_text

Generating Subword Text

output = generate_text_by_subword(
    input_str='To be or not to be',
    model=model,
    token_mapping=token_mapping,
    tokenizer=my_tokenizer,
    num_tokens=30,
    temperature=1.5,
    topk=100,
)

print(output)

This approach generally produces more coherent text than character-level models.

Character vs Subword Models

Character-based models are simple and easy to implement but slower and less efficient.

Subword-based models are more powerful and scalable but require a tokenizer.

Most modern NLP models use subword tokenization because it balances vocabulary size and language coverage.