This project provides an asynchronous Python tool for downloading all PDF problems from the Harvard-MIT Mathematics Tournament (HMMT) archive. It was created to streamline the process of collecting problem sets for AI/LLM data annotation, particularly for annotating HMMT problems in a project I worked on as a data labeling contractor.
Manually gathering large sets of math problems for annotation is tedious and error-prone. This tool automates the process, making it easy to build high-quality datasets for machine learning, natural language processing, or educational research. It is especially useful for anyone working on AI/LLM data annotation projects involving mathematical content, or Olympiad-style math problems.
- Automated Discovery: Crawls the HMMT archive and subpages to find all available PDF problem sets.
- Concurrent Downloads: Downloads PDFs in parallel for speed and efficiency.
- Robustness: Handles network errors and timeouts with retries and exponential backoff.
- Unique Filenames: Saves each PDF with a unique name to prevent overwriting.
- Download Logging: Records all downloads in a JSON log for traceability.
-
Install dependencies:
pip install aiohttp aiofiles async_timeout beautifulsoup4 tenacity tqdm
-
Run the scraper:
python src/main.py
-
Results:
- All PDFs will be saved in the
downloaded_pdfs/
directory. - A log of all downloads will be written to
download_log.json
.
- All PDFs will be saved in the
This tool was developed to make life easier for AI/LLM data annotators. By automating the collection of HMMT problems, it enables rapid dataset creation and reduces manual effort, allowing annotators to focus on labeling and analysis rather than data gathering.
src/
main.py # Main scraper script
downloaded_pdfs/ # Output directory for PDFs
download_log.json # Log of all downloaded files
README.md # Project documentation
This project is provided as-is for research and annotation purposes.