HMMT PDF Scraper

This project provides an asynchronous Python tool for downloading all PDF problems from the Harvard-MIT Mathematics Tournament (HMMT) archive. It was created to streamline the process of collecting problem sets for AI/LLM data annotation, particularly for annotating HMMT problems in a project I worked on as a data labeling contractor.

Purpose

Manually gathering large sets of math problems for annotation is tedious and error-prone. This tool automates the process, making it easy to build high-quality datasets for machine learning, natural language processing, or educational research. It is especially useful for anyone working on AI/LLM data annotation projects involving mathematical content, or Olympiad-style math problems.

Features

Automated Discovery: Crawls the HMMT archive and subpages to find all available PDF problem sets.
Concurrent Downloads: Downloads PDFs in parallel for speed and efficiency.
Robustness: Handles network errors and timeouts with retries and exponential backoff.
Unique Filenames: Saves each PDF with a unique name to prevent overwriting.
Download Logging: Records all downloads in a JSON log for traceability.

Usage

Install dependencies:

pip install aiohttp aiofiles async_timeout beautifulsoup4 tenacity tqdm

Run the scraper:
```
python src/main.py
```
Results:
- All PDFs will be saved in the downloaded_pdfs/ directory.
- A log of all downloads will be written to download_log.json.

How It Fits Annotation Workflows

This tool was developed to make life easier for AI/LLM data annotators. By automating the collection of HMMT problems, it enables rapid dataset creation and reduces manual effort, allowing annotators to focus on labeling and analysis rather than data gathering.

Project Structure

src/
  main.py           # Main scraper script
  downloaded_pdfs/  # Output directory for PDFs
download_log.json   # Log of all downloaded files
README.md           # Project documentation

License

This project is provided as-is for research and annotation purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HMMT PDF Scraper

Purpose

Features

Usage

How It Fits Annotation Workflows

Project Structure

License

About

Uh oh!

Releases

Packages

Languages

alexander-wei/hmmt_scraper

Folders and files

Latest commit

History

Repository files navigation

HMMT PDF Scraper

Purpose

Features

Usage

How It Fits Annotation Workflows

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages