NLP Analysis of BBC News Articles

This repository contains the complete workflow for an advanced Natural Language Processing (NLP) project performed on the BBC News dataset from 2004-2005. The project demonstrates a series of state-of-the-art techniques, including data-centric AI for custom Named Entity Recognition (NER) using a Large Language Model (LLM), systematic hyperparameter tuning, advanced class imbalance handling, and abstractive summarization.

Project Goal

The primary goal of this project was to perform a multi-faceted NLP analysis on a dataset of news articles. The defined problem consisted of three core tasks:

Advanced Classification: To classify news articles from broad categories into granular sub-categories (e.g., 'Economy', 'Cinema', 'Football') using weakly supervised labels and advanced training techniques.
Custom Entity Extraction: To build a high-performance NER model capable of identifying and classifying media personalities into specific professional roles (e.g., Politician, Musician, Athlete).
Conditional Summarization: To filter the dataset for all articles pertaining to events in the month of "April" and generate a concise, abstractive summary for each.

Repository Structure

The project is organized into a modular structure for clarity and reproducibility.

bbc-nlp/
├── .env
├── README.md
├── archive/
├── data/
│   ├── bbc_raw/
│   └── output/
├── models/
│   ├── augmented-classifier/
│   ├── distilbert-model-v2/
│   └── roberta-model-v2/
├── params/
└── src/
    ├── classification/
    ├── ner/
    ├── preprocessing/
    └── summarization/

Setup and Installation

This project is designed to be run in a containerized environment with GPU support, such as a Paperspace Gradient Notebook using the paperspace/gradient-base:pt211-tf215-cudatk120-py311 image.

Prerequisites:

Python 3.11
Poetry for dependency management
Git
A Gemini API Key

Steps:

Clone the Repository

git clone https://github.com/owhonda-moses/bbc-news-nlp.git
cd bbc-nlp

Create Personal Access Token File

This project requires a GitHub Personal Access Token for the setup script. Create a file named pat.env in the root directory and add it to .gitignore:
```
GITHUB_TOKEN=your_personal_access_token
```
Create Environment File

Create a file named .env in the root directory for your Gemini API key:
```
GEMINI_API_KEY=your_api_key
```
Run the Setup Script

The setup.sh script automates the environment setup, including dependency installation and downloading NLP models.
```
chmod +x setup.sh
bash setup.sh
```

Usage: Reproducing the Results

The project is broken down into a series of scripts within the src/ subdirectories. They should be run in the following order.

1. Initial Data Split

This is the first step for the entire project.

python -m src.preprocessing.main_split

2. Test Set Creation

This creates the manually verified validation and test sets.

- Manually create annotations in src/preprocessing/annotations.json

- Apply annotations to create the final test_set.csv

python -m src.preprocessing.apply_annotations

- Split into ner_val.csv and ner_test.csv

python -m src.preprocessing.ner_split

3. Sub-Category Classification Pipeline

This trains the text classifier used by the NER pipeline.

- Prepare training data from weakly-supervised labels

python -m src.classification.prepare_zeroshot

- Augment the data

python -m src.classification.augment_data

- Train the final classifier

python -m src.classification.train_augmented

4. Custom Named Entity Recognition (NER) Pipeline

This is the core data-centric AI workflow for the NER task.

- Build the knowledge base from Wikidata

python -m src.ner.bulkseed

- Augment the knowledge base from Wikipedia

python -m src.ner.scraper

- Merge knowledge bases

python -m src.ner.merge

labeler_v1 is the base model labeled purely from knowledge base

- Use the LLM to generate high-quality labels for the training data

python -m src.ner.labeler_v2

- Review and correct the LLM's output (optional but recommended)

python -m src.ner.correct_ner --filter <keyword>

- Pre-process the final, augmented data for model training

python -m src.ner.preprocess

- Train the final v2 NER model

python -m src.ner.train_ner

- Evaluate the final model

python -m src.ner.evaluate

5. Conditional Summarization

python -m src.summarization.april_events

Methodology

Sub-Category Classification

The primary challenge was the lack of pre-existing sub-category labels. Our final, successful approach was:

Weak Supervision via Zero-Shot Learning: We used a facebook/bart-large-mnli model to generate high-quality, zero-shot labels for our training data.
Systematic Hyperparameter Tuning: We used Optuna to perform a robust search for the optimal learning rate, weight decay, and random seed.
Targeted Data Augmentation: We used a T5 model with context-aware prompts to generate synthetic data for under-represented classes.

Custom Named Entity Recognition

This task was approached with a state-of-the-art, data-centric pipeline to create a high-quality training set.

Knowledge Base Creation: We programmatically built a large knowledge base of over 1 million names and their professional roles by querying Wikidata (bulkseed.py) and scraping Wikipedia (scraper.py).
Document-Level LLM Labeling: We developed a sophisticated script (labeler_v2.py) that uses the Gemini 2.5 Flash model as a reasoning engine. For each article, it predicts the sub-category, links all mentions of the same person, and uses the full article context and the knowledge base to assign a highly accurate NER label in a single pass. The script uses an advanced one-shot prompt to ensure reliable JSON output and can filter out non-person entities.
Human-in-the-Loop Correction: After the automated labeling, the correct_ner.py script provides an interactive interface to perform a final, targeted review of the AI-generated labels to fix any subtle errors.
Hybrid Augmentation: To create a balanced training set, the preprocess.py script applies a hybrid strategy that uses oversampling to synthetically increase the number of examples for minority classes.
Advanced Model Training: The final train_ner.py script uses several state-of-the-art techniques, including a weighted loss function and Automatic Mixed Precision (AMP) to speed up training.

Conditional Summarization

Filtering: We perform basic filtering on the gold standard test data for articles containing "April".
Hybrid Summarization: A sophisticated pipeline performs an extractive step, identifying the most relevant sentences in an article using keyword and pattern matching.
Abstractive Summarization: These key sentences are then fed to a pre-trained facebook/bart-large-cnn model to generate a high-quality, human-like abstractive summary. This is more robust than summarizing the entire article.

Models

The final models are saved in the models/ directory:

augmented-classifier: The final, best-performing sub-category classifier.
distilbert-model-v2: The final DistilBERT NER model.
roberta-model-v2: The final RoBERTa NER model. (Recommended)

Limitations and Future Work

Upstream Model Dependence: The NER pipeline's quality is dependent on the performance of the sub-classification model and the initial entity recognition from spaCy.
Knowledge Base Errors: The programmatically-built knowledge base, while large, may contain some noise and errors that are corrected over time via review.
Future Work: The most impactful next step would be to continue the data-centric loop by using the trained v2 NER model to find more errors, correct them, and re-train a v3 model to further improve performance.

Acknowledgments

This project uses the BBC News Dataset, originally collected for the publication: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
The project heavily relies on the open-source work of Hugging Face, spaCy, PyTorch, and the broader Python data science community.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
archive		archive
data		data
models		models
params		params
src		src
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.sh		setup.sh

owhonda-moses/bbc-news-nlp

Folders and files

Latest commit

History

Repository files navigation