Building a News Classifier from Scratch with a pytorch based model

Building a News Classifier from Scratch with a Custom Transformer Model 🧠

Ever wondered how news apps categorize articles so accurately? It's often done using Transformers, a powerful neural network architecture that forms the backbone of modern language understanding. In this post, we'll build a news category classifier from the ground up, using our own custom Transformer. We'll explore the key components, prepare a real-world dataset, and train our model to classify news articles into one of 42 categories.

1. The Dataset: News Category Dataset

Our journey starts with the News Category Dataset from Kaggle, a large collection of news headlines and short descriptions. The first step is to prepare this text for our model.

We combine the headline and short_description columns into a single full_text column. We then create a numerical mapping for each unique news category.

Python
# Combine headline and short_description
df['full_text'] = df['headline'] + ' ' + df['short_description']

# Create numerical labels from categories
unique_categories = df['category'].unique().tolist()
category_to_id = {cat: i for i, cat in enumerate(unique_categories)}
df['category_id'] = df['category'].map(category_to_id)

Next, we convert the text into a numerical format using a tokenizer. We fit a Tokenizer on our combined text, convert each article into a sequence of numerical IDs, and then use pad_sequences to ensure every sequence has a uniform length (in our case, 55 tokens) for batch processing.

Python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 30000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="")
tokenizer.fit_on_texts(df['full_text'])

sequences = tokenizer.texts_to_sequences(df['full_text'])
max_sequence_length = 55
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length)

Finally, we split the padded sequences and their corresponding category IDs into training and validation sets using train_test_split from Scikit-learn, which is crucial for evaluating our model's performance on unseen data.

2. The Model Architecture: A Custom Transformer

Our custom model is built in PyTorch and consists of three main components: a custom attention layer, a Transformer block, and the final classification model.

The Custom Attention Layer

The CustomAttention class is the heart of the Transformer. It allows the model to weigh the importance of different words in a sequence.

Python
import torch.nn as nn
import torch.nn.functional as F

class CustomAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        
        # Calculate attention scores using matrix multiplication
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.embed_dim, dtype=torch.float32))

        # Apply softmax to get attention weights
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Apply weights to values
        output = torch.matmul(attn_weights, v)
        return output, attn_weights

This module uses three nn.Linear layers to project the input into query, key, and value vectors. The core attention calculation is a torch.matmul between the queries and the transposed keys. This creates a matrix of scores that show the relevance of each token to every other token. A F.softmax function then turns these scores into a probability distribution, which is used to create a weighted sum of the value vectors, producing the final output.

The Transformer Block

The TransformerBlock is a modular unit that combines the attention mechanism with a feed-forward network, residual connections, and layer normalization.

Python
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.attention = CustomAttention(embed_dim)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(...)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Apply attention, residual connection, and layer normalization
        attn_output, _ = self.attention(x)
        x = self.norm1(x + self.dropout(attn_output))

        # Apply feed-forward network, residual connection, and layer normalization
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

The x + self.dropout(...) pattern represents a residual connection, which helps gradients flow through the network and stabilizes training. The nn.LayerNorm layers normalize the outputs, a crucial step for training deep models effectively.

The SimpleModel

The SimpleModel brings everything together. It starts with an nn.Embedding layer that converts our token IDs into dense vectors. These vectors are then passed through a series of TransformerBlock layers.

Python
class SimpleModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, ff_dim, num_transformer_blocks, num_classes, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(embed_dim, ff_dim, dropout) for _ in range(num_transformer_blocks)
        ])
        self.output_layer = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        for block in self.transformer_blocks:
            x = block(x)
        output = self.output_layer(x[:, -1, :])
        return output

The nn.Embedding layer is essential; it's a lookup table that maps each unique word ID to a high-dimensional vector. The final nn.Linear layer projects the last token's output from the Transformer blocks into our 42 news categories.

3. Training and Evaluation

With our data prepared and our model defined, we set up the training process. We use CrossEntropyLoss to measure the error and the Adam optimizer to update the model's parameters.

The data is loaded in batches using torch.utils.data.DataLoader. This utility simplifies iterating over our dataset, shuffling it for training, and creating mini-batches.

Python
from torch.utils.data import TensorDataset, DataLoader

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

The training loop itself is standard PyTorch practice:

Set the model to training mode (model.train()).
Move data to the GPU (.to(device)) for faster computation.
Zero the gradients (optimizer.zero_grad()).
Perform a forward pass to get predictions.
Calculate the loss.
Propagate the loss backward (loss.backward()) to compute gradients.
Update the weights (optimizer.step()).

After each epoch, we switch to evaluation mode (model.eval()) to calculate validation loss and accuracy without updating the weights.

Our final training run achieved a validation accuracy of 60.72%. The detailed classification report showed varying performance across categories, with topics like POLITICS and ENTERTAINMENT performing well due to more training examples. This highlights the importance of addressing potential class imbalance.

You've now seen the full process—from raw data to a working classification model—and have a better understanding of the PyTorch code that powers it. This project serves as an excellent foundation for exploring more advanced language models.

Reference:

github for full code

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog