Skip to main content

Building a News Classifier from Scratch with a pytorch based model


Building a News Classifier from Scratch with a Custom Transformer Model 🧠

Ever wondered how news apps categorize articles so accurately? It's often done using Transformers, a powerful neural network architecture that forms the backbone of modern language understanding. In this post, we'll build a news category classifier from the ground up, using our own custom Transformer. We'll explore the key components, prepare a real-world dataset, and train our model to classify news articles into one of 42 categories.


1. The Dataset: News Category Dataset

Our journey starts with the News Category Dataset from Kaggle, a large collection of news headlines and short descriptions. The first step is to prepare this text for our model.

We combine the headline and short_description columns into a single full_text column. We then create a numerical mapping for each unique news category.

Python
# Combine headline and short_description
df['full_text'] = df['headline'] + ' ' + df['short_description']

# Create numerical labels from categories
unique_categories = df['category'].unique().tolist()
category_to_id = {cat: i for i, cat in enumerate(unique_categories)}
df['category_id'] = df['category'].map(category_to_id)

Next, we convert the text into a numerical format using a tokenizer. We fit a Tokenizer on our combined text, convert each article into a sequence of numerical IDs, and then use pad_sequences to ensure every sequence has a uniform length (in our case, 55 tokens) for batch processing.

Python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 30000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="")
tokenizer.fit_on_texts(df['full_text'])

sequences = tokenizer.texts_to_sequences(df['full_text'])
max_sequence_length = 55
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length)

Finally, we split the padded sequences and their corresponding category IDs into training and validation sets using train_test_split from Scikit-learn, which is crucial for evaluating our model's performance on unseen data.


2. The Model Architecture: A Custom Transformer

Our custom model is built in PyTorch and consists of three main components: a custom attention layer, a Transformer block, and the final classification model.

The Custom Attention Layer

The CustomAttention class is the heart of the Transformer. It allows the model to weigh the importance of different words in a sequence.

Python
import torch.nn as nn
import torch.nn.functional as F

class CustomAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        
        # Calculate attention scores using matrix multiplication
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.embed_dim, dtype=torch.float32))

        # Apply softmax to get attention weights
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Apply weights to values
        output = torch.matmul(attn_weights, v)
        return output, attn_weights

This module uses three nn.Linear layers to project the input into query, key, and value vectors. The core attention calculation is a torch.matmul between the queries and the transposed keys. This creates a matrix of scores that show the relevance of each token to every other token. A F.softmax function then turns these scores into a probability distribution, which is used to create a weighted sum of the value vectors, producing the final output.

The Transformer Block

The TransformerBlock is a modular unit that combines the attention mechanism with a feed-forward network, residual connections, and layer normalization.

Python
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.attention = CustomAttention(embed_dim)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(...)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Apply attention, residual connection, and layer normalization
        attn_output, _ = self.attention(x)
        x = self.norm1(x + self.dropout(attn_output))

        # Apply feed-forward network, residual connection, and layer normalization
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

The x + self.dropout(...) pattern represents a residual connection, which helps gradients flow through the network and stabilizes training. The nn.LayerNorm layers normalize the outputs, a crucial step for training deep models effectively.

The SimpleModel

The SimpleModel brings everything together. It starts with an nn.Embedding layer that converts our token IDs into dense vectors. These vectors are then passed through a series of TransformerBlock layers.

Python
class SimpleModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, ff_dim, num_transformer_blocks, num_classes, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(embed_dim, ff_dim, dropout) for _ in range(num_transformer_blocks)
        ])
        self.output_layer = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        for block in self.transformer_blocks:
            x = block(x)
        output = self.output_layer(x[:, -1, :])
        return output

The nn.Embedding layer is essential; it's a lookup table that maps each unique word ID to a high-dimensional vector. The final nn.Linear layer projects the last token's output from the Transformer blocks into our 42 news categories.


3. Training and Evaluation

With our data prepared and our model defined, we set up the training process. We use CrossEntropyLoss to measure the error and the Adam optimizer to update the model's parameters.

The data is loaded in batches using torch.utils.data.DataLoader. This utility simplifies iterating over our dataset, shuffling it for training, and creating mini-batches.

Python
from torch.utils.data import TensorDataset, DataLoader

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

The training loop itself is standard PyTorch practice:

  1. Set the model to training mode (model.train()).

  2. Move data to the GPU (.to(device)) for faster computation.

  3. Zero the gradients (optimizer.zero_grad()).

  4. Perform a forward pass to get predictions.

  5. Calculate the loss.

  6. Propagate the loss backward (loss.backward()) to compute gradients.

  7. Update the weights (optimizer.step()).

After each epoch, we switch to evaluation mode (model.eval()) to calculate validation loss and accuracy without updating the weights.

Our final training run achieved a validation accuracy of 60.72%. The detailed classification report showed varying performance across categories, with topics like POLITICS and ENTERTAINMENT performing well due to more training examples. This highlights the importance of addressing potential class imbalance.

You've now seen the full process—from raw data to a working classification model—and have a better understanding of the PyTorch code that powers it. This project serves as an excellent foundation for exploring more advanced language models.

Reference:

github for full code 

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction:   When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...