This repository contains the complete workflow for an advanced Natural Language Processing (NLP) project performed on the BBC News dataset from 2004-2005. The project demonstrates a series of state-of-the-art techniques, including data-centric AI for custom Named Entity Recognition (NER) using a Large Language Model (LLM), systematic hyperparameter tuning, advanced class imbalance handling, and abstractive summarization.
- Project Goal
- Repository Structure
- Setup and Installation
- Usage: Reproducing the Results
- Methodology
- Models
- Limitations and Future Work
- Acknowledgments
The primary goal of this project was to perform a multi-faceted NLP analysis on a dataset of news articles. The defined problem consisted of three core tasks:
- Advanced Classification: To classify news articles from broad categories into granular sub-categories (e.g., 'Economy', 'Cinema', 'Football') using weakly supervised labels and advanced training techniques.
- Custom Entity Extraction: To build a high-performance NER model capable of identifying and classifying media personalities into specific professional roles (e.g., Politician, Musician, Athlete).
- Conditional Summarization: To filter the dataset for all articles pertaining to events in the month of "April" and generate a concise, abstractive summary for each.
The project is organized into a modular structure for clarity and reproducibility.
bbc-nlp/
├── .env
├── README.md
├── archive/
├── data/
│ ├── bbc_raw/
│ └── output/
├── models/
│ ├── augmented-classifier/
│ ├── distilbert-model-v2/
│ └── roberta-model-v2/
├── params/
└── src/
├── classification/
├── ner/
├── preprocessing/
└── summarization/
This project is designed to be run in a containerized environment with GPU support, such as a Paperspace Gradient Notebook using the paperspace/gradient-base:pt211-tf215-cudatk120-py311
image.
Prerequisites:
- Python 3.11
- Poetry for dependency management
- Git
- A Gemini API Key
-
Clone the Repository
git clone https://github.com/owhonda-moses/bbc-news-nlp.git cd bbc-nlp
-
Create Personal Access Token File
This project requires a GitHub Personal Access Token for the setup script. Create a file named
pat.env
in the root directory and add it to.gitignore
:GITHUB_TOKEN=your_personal_access_token
-
Create Environment File
Create a file named
.env
in the root directory for your Gemini API key:GEMINI_API_KEY=your_api_key
-
Run the Setup Script
The
setup.sh
script automates the environment setup, including dependency installation and downloading NLP models.chmod +x setup.sh bash setup.sh
The project is broken down into a series of scripts within the src/
subdirectories. They should be run in the following order.
This is the first step for the entire project.
python -m src.preprocessing.main_split
This creates the manually verified validation and test sets.
python -m src.preprocessing.apply_annotations
python -m src.preprocessing.ner_split
This trains the text classifier used by the NER pipeline.
python -m src.classification.prepare_zeroshot
python -m src.classification.augment_data
python -m src.classification.train_augmented
This is the core data-centric AI workflow for the NER task.
python -m src.ner.bulkseed
python -m src.ner.scraper
python -m src.ner.merge
labeler_v1 is the base model labeled purely from knowledge base
python -m src.ner.labeler_v2
python -m src.ner.correct_ner --filter <keyword>
python -m src.ner.preprocess
python -m src.ner.train_ner
python -m src.ner.evaluate
python -m src.summarization.april_events
The primary challenge was the lack of pre-existing sub-category labels. Our final, successful approach was:
- Weak Supervision via Zero-Shot Learning: We used a
facebook/bart-large-mnli
model to generate high-quality, zero-shot labels for our training data. - Systematic Hyperparameter Tuning: We used
Optuna
to perform a robust search for the optimal learning rate, weight decay, and random seed. - Targeted Data Augmentation: We used a T5 model with context-aware prompts to generate synthetic data for under-represented classes.
This task was approached with a state-of-the-art, data-centric pipeline to create a high-quality training set.
- Knowledge Base Creation: We programmatically built a large knowledge base of over 1 million names and their professional roles by querying Wikidata (
bulkseed.py
) and scraping Wikipedia (scraper.py
). - Document-Level LLM Labeling: We developed a sophisticated script (
labeler_v2.py
) that uses the Gemini 2.5 Flash model as a reasoning engine. For each article, it predicts the sub-category, links all mentions of the same person, and uses the full article context and the knowledge base to assign a highly accurate NER label in a single pass. The script uses an advanced one-shot prompt to ensure reliable JSON output and can filter out non-person entities. - Human-in-the-Loop Correction: After the automated labeling, the
correct_ner.py
script provides an interactive interface to perform a final, targeted review of the AI-generated labels to fix any subtle errors. - Hybrid Augmentation: To create a balanced training set, the
preprocess.py
script applies a hybrid strategy that uses oversampling to synthetically increase the number of examples for minority classes. - Advanced Model Training: The final
train_ner.py
script uses several state-of-the-art techniques, including a weighted loss function and Automatic Mixed Precision (AMP) to speed up training.
- Filtering: We perform basic filtering on the gold standard test data for articles containing "April".
- Hybrid Summarization: A sophisticated pipeline performs an extractive step, identifying the most relevant sentences in an article using keyword and pattern matching.
- Abstractive Summarization: These key sentences are then fed to a pre-trained
facebook/bart-large-cnn
model to generate a high-quality, human-like abstractive summary. This is more robust than summarizing the entire article.
The final models are saved in the models/
directory:
augmented-classifier
: The final, best-performing sub-category classifier.distilbert-model-v2
: The final DistilBERT NER model.roberta-model-v2
: The final RoBERTa NER model. (Recommended)
- Upstream Model Dependence: The NER pipeline's quality is dependent on the performance of the sub-classification model and the initial entity recognition from spaCy.
- Knowledge Base Errors: The programmatically-built knowledge base, while large, may contain some noise and errors that are corrected over time via review.
- Future Work: The most impactful next step would be to continue the data-centric loop by using the trained v2 NER model to find more errors, correct them, and re-train a v3 model to further improve performance.
- This project uses the BBC News Dataset, originally collected for the publication: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
- The project heavily relies on the open-source work of Hugging Face, spaCy, PyTorch, and the broader Python data science community.