A Python tool for scraping a set of repositories from GitHub to a MongoDB database.
To use the data which has been collected by gha, you do not need to follow this readme and run it yourself, though you may still wish to if you want to collect a small local dataset for testing.
Instead, please see the project wiki page on using the data.
- Docker
- Python 3.7 or greater
- Clone this repository and
cdinto the cloned directory - Create and activate a virtual environment
- Install this package (
gha) into the virtual environment
git clone https://github.com/Southampton-RSG/github-analysis.git
cd github-analysis
python3 -m venv venv
source venv/bin/activate
pip install .- Create a GitHub personal access token at https://github.com/settings/tokens
- No permissions are required
- Populate a
.envfile from.env.template
- Start MongoDB database containers
docker-composecan be installed withpipif necessary
- Start
ghascraper using a repo list file- Virtual environment created above must still be active
docker-compose up -d
gha fetch -f tests/data/UKRI_10.txtThe database web console can be accessed at http://localhost:8081/db/github/.