Artificial Intelligence has made enormous progress in Natural Language Processing (NLP), especially in the area of text generation. Modern systems can generate realistic text by learning patterns from large datasets.
In this tutorial, we will build a text generation model using PyTorch trained on a dataset containing works from Shakespeare. We will explore two different approaches:
Character-level text generation
Subword tokenization using Hugging Face
By the end of this article, you will understand how language models generate text token by token.
Dataset: Shakespeare Text
To train the model, we use a dataset containing Shakespeare's writings.
DATA_FILE = '../data/shakespeare_small.txt'
with open(DATA_FILE, 'r') as data_file:
raw_text = data_file.read()
print(f'Number of characters in text file: {len(raw_text):,}')
This dataset contains thousands of characters that will be used to train our neural network to predict the next token in a sequence.
Character-Based Text Generation
In the first approach, each character is treated as a token.
Example:
"hello" → ['h', 'e', 'l', 'l', 'o']
The model learns to predict the next character based on the previous ones.
Step 1 — Text Normalization
Before tokenization, we normalize the text.
def normalize_text(text: str) -> str:
normalized_text = text.lower()
return normalized_text
This converts all text to lowercase while keeping punctuation and special characters.
Step 2 — Pretokenization
Next, we split the text into characters.
def pretokenize_text(text: str) -> list[str]:
smaller_pieces = [char for char in text]
return smaller_pieces
Step 3 — Tokenization
Now we combine normalization and pretokenization.
def tokenize_text(text: str) -> list[str]:
normalized_text = normalize_text(text)
pretokenized_text = pretokenize_text(normalized_text)
tokenized_text = pretokenized_text
return tokenized_text
Step 4 — Encoding Tokens into IDs
We convert tokens into integer IDs.
encoded_text, character_mapping = encode_text(raw_text, tokenize_text)
The mapping allows conversion between characters and integer IDs.
character → integer ID
integer ID → character
Preparing the Dataset
Next, we split the encoded text into sequences.
sequence_length = 32
batch_size = 32
train_dataset = ShakespeareDataset(encoded_text, sequence_length)
train_loader = DataLoader(
train_dataset,
shuffle=False,
batch_size=batch_size,
)
Each sequence contains 32 characters, and the model learns to predict the next character.
Building the Model
We create a neural network using a helper function.
model = build_model(n_tokens)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
CrossEntropyLoss is used for classification tasks, while the Adam optimizer updates model weights during training.
Text Generation Function (Character-Based)
The following function generates text character by character.
def generate_text_by_char(
input_str: str,
model,
token_mapping,
num_chars: int = 100,
temperature: float = 1.0,
topk: int | None = None,
):
tokenized_text = tokenize_text(input_str)
generated_tokens = []
for _ in range(num_chars):
new_char = next_token(
tokenized_text=(tokenized_text + generated_tokens),
model=model,
token_mapping=token_mapping,
temperature=temperature,
topk=topk,
device=device,
)
generated_tokens.append(new_char)
full_text = ''.join(tokenized_text + generated_tokens)
return full_text
Training the Model
Now we train the model.
TEST_PHRASE = 'To be or not to be'
epochs = 5 if device == 'cpu' else 25
start = start_time()
for epoch in range(epochs):
model.train()
total_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
output = model(X_batch.to(device))
loss = criterion(output.transpose(1, 2), y_batch.to(device))
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_loader)}')
gen_output = generate_text_by_char(
input_str=TEST_PHRASE,
model=model,
num_chars=100,
)
print(gen_output)
During training, we generate text after each epoch to observe improvements.
Generating Text
Once the model is trained, we can generate new text.
output = generate_text_by_char(
input_str='To be or not to be',
model=model,
num_chars=100,
temperature=1.0,
topk=None,
)
print(output)
Adjusting parameters changes creativity.
Temperature controls randomness, while top-k sampling restricts token selection.
Subword-Based Text Generation
Character-based models are simple but inefficient. Modern NLP models use subword tokenization.
We use a tokenizer from the Transformers library.
Loading a Hugging Face Tokenizer
from transformers import AutoTokenizer
model_name = 'bert-base-uncased'
my_tokenizer = AutoTokenizer.from_pretrained(
model_name,
)
This tokenizer splits words into subword units.
Example:
playing → play + ##ing
Encoding Text with the Tokenizer
encoded_text, token_mapping = encode_text_from_tokenizer(
text=raw_text,
tokenizer=my_tokenizer,
)
Each token is converted into an integer ID.
Preparing the Dataset
sequence_length = 32
batch_size = 32
train_dataset = ShakespeareDataset(encoded_text, sequence_length)
train_loader = DataLoader(
train_dataset,
shuffle=False,
batch_size=batch_size,
)
Subword Text Generation Function
def generate_text_by_subword(
input_str: str,
model,
token_mapping,
tokenizer,
num_tokens: int = 100,
temperature: float = 1.0,
topk: int | None = None,
):
tokenized_text = tokenize_text_from_tokenizer(
tokenizer=tokenizer,
text=input_str,
)
generated_tokens = []
for _ in range(num_tokens):
new_token = next_token(
tokenized_text=(tokenized_text + generated_tokens),
model=model,
token_mapping=token_mapping,
temperature=temperature,
topk=topk,
device=device,
)
generated_tokens.append(new_token)
output_ids = tokenizer.convert_tokens_to_ids(
tokenized_text + generated_tokens
)
full_text = tokenizer.decode(output_ids)
return full_text
Generating Subword Text
output = generate_text_by_subword(
input_str='To be or not to be',
model=model,
token_mapping=token_mapping,
tokenizer=my_tokenizer,
num_tokens=30,
temperature=1.5,
topk=100,
)
print(output)
This approach generally produces more coherent text than character-level models.
Character vs Subword Models
Character-based models are simple and easy to implement but slower and less efficient.
Subword-based models are more powerful and scalable but require a tokenizer.
Most modern NLP models use subword tokenization because it balances vocabulary size and language coverage.
