Building a News Classifier from Scratch with a Custom Transformer Model ðŸ§
Ever wondered how news apps categorize articles so accurately? It's often done using Transformers, a powerful neural network architecture that forms the backbone of modern language understanding. In this post, we'll build a news category classifier from the ground up, using our own custom Transformer. We'll explore the key components, prepare a real-world dataset, and train our model to classify news articles into one of 42 categories.
1. The Dataset: News Category Dataset
Our journey starts with the News Category Dataset from Kaggle, a large collection of news headlines and short descriptions. The first step is to prepare this text for our model.
We combine the headline
and short_description
columns into a single full_text
column. We then create a numerical mapping for each unique news category.
# Combine headline and short_description
df['full_text'] = df['headline'] + ' ' + df['short_description']
# Create numerical labels from categories
unique_categories = df['category'].unique().tolist()
category_to_id = {cat: i for i, cat in enumerate(unique_categories)}
df['category_id'] = df['category'].map(category_to_id)
Next, we convert the text into a numerical format using a tokenizer. We fit a Tokenizer
on our combined text, convert each article into a sequence of numerical IDs, and then use pad_sequences
to ensure every sequence has a uniform length (in our case, 55 tokens) for batch processing.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
vocab_size = 30000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="")
tokenizer.fit_on_texts(df['full_text'])
sequences = tokenizer.texts_to_sequences(df['full_text'])
max_sequence_length = 55
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length)
Finally, we split the padded sequences and their corresponding category IDs into training and validation sets using train_test_split
from Scikit-learn, which is crucial for evaluating our model's performance on unseen data.
2. The Model Architecture: A Custom Transformer
Our custom model is built in PyTorch and consists of three main components: a custom attention layer, a Transformer block, and the final classification model.
The Custom Attention Layer
The CustomAttention
class is the heart of the Transformer. It allows the model to weigh the importance of different words in a sequence.
import torch.nn as nn
import torch.nn.functional as F
class CustomAttention(nn.Module):
def __init__(self, embed_dim):
super().__init__()
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
def forward(self, x):
q = self.query(x)
k = self.key(x)
v = self.value(x)
# Calculate attention scores using matrix multiplication
attn_scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.embed_dim, dtype=torch.float32))
# Apply softmax to get attention weights
attn_weights = F.softmax(attn_scores, dim=-1)
# Apply weights to values
output = torch.matmul(attn_weights, v)
return output, attn_weights
This module uses three nn.Linear
layers to project the input into query, key, and value vectors. The core attention calculation is a torch.matmul
between the queries and the transposed keys. This creates a matrix of scores that show the relevance of each token to every other token. A F.softmax
function then turns these scores into a probability distribution, which is used to create a weighted sum of the value vectors, producing the final output.
The Transformer Block
The TransformerBlock
is a modular unit that combines the attention mechanism with a feed-forward network, residual connections, and layer normalization.
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.attention = CustomAttention(embed_dim)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(...)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Apply attention, residual connection, and layer normalization
attn_output, _ = self.attention(x)
x = self.norm1(x + self.dropout(attn_output))
# Apply feed-forward network, residual connection, and layer normalization
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))
return x
The x + self.dropout(...)
pattern represents a residual connection, which helps gradients flow through the network and stabilizes training. The nn.LayerNorm
layers normalize the outputs, a crucial step for training deep models effectively.
The SimpleModel
The SimpleModel
brings everything together. It starts with an nn.Embedding
layer that converts our token IDs into dense vectors. These vectors are then passed through a series of TransformerBlock
layers.
class SimpleModel(nn.Module):
def __init__(self, vocab_size, embed_dim, ff_dim, num_transformer_blocks, num_classes, dropout=0.1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.transformer_blocks = nn.ModuleList([
TransformerBlock(embed_dim, ff_dim, dropout) for _ in range(num_transformer_blocks)
])
self.output_layer = nn.Linear(embed_dim, num_classes)
def forward(self, x):
x = self.embedding(x)
for block in self.transformer_blocks:
x = block(x)
output = self.output_layer(x[:, -1, :])
return output
The nn.Embedding
layer is essential; it's a lookup table that maps each unique word ID to a high-dimensional vector. The final nn.Linear
layer projects the last token's output from the Transformer blocks into our 42 news categories.
3. Training and Evaluation
With our data prepared and our model defined, we set up the training process. We use CrossEntropyLoss
to measure the error and the Adam
optimizer to update the model's parameters.
The data is loaded in batches using torch.utils.data.DataLoader
. This utility simplifies iterating over our dataset, shuffling it for training, and creating mini-batches.
from torch.utils.data import TensorDataset, DataLoader
# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
The training loop itself is standard PyTorch practice:
Set the model to training mode (
model.train()
).Move data to the GPU (
.to(device)
) for faster computation.Zero the gradients (
optimizer.zero_grad()
).Perform a forward pass to get predictions.
Calculate the loss.
Propagate the loss backward (
loss.backward()
) to compute gradients.Update the weights (
optimizer.step()
).
After each epoch, we switch to evaluation mode (model.eval()
) to calculate validation loss and accuracy without updating the weights.
Our final training run achieved a validation accuracy of 60.72%. The detailed classification report showed varying performance across categories, with topics like POLITICS
and ENTERTAINMENT
performing well due to more training examples. This highlights the importance of addressing potential class imbalance.
You've now seen the full process—from raw data to a working classification model—and have a better understanding of the PyTorch code that powers it. This project serves as an excellent foundation for exploring more advanced language models.
Reference:
Comments
Post a Comment