A tool to download all commits made by a GitHub user across specified repositories within a date range using GitHub API. Additionally, it can fetch commits made in pull requests by the user instead of just the final PR merge commit.
The tool generates diffs for each commit and provides text reports of all commits grouped by repositories.
Prerequisites:
-
Python 3.9 or higher.
-
Windows or Linux OS.
-
Clone this repository, download the ZIP or a release.
-
Run the appropriate script for your OS:
- Windows:
.\run.ps1 - Linux:
./run.sh
The run script is a helper that automates the installation of dependencies and simplifies the usage of the tool. It also creates a Python virtual environment if it doesn't exist.
NOTE: On Linux, you might need to make the script executable first by running:
chmod +x run.sh
-
Clone this repository or download the ZIP.
-
Create a virtual environment:
- Windows (PowerShell):
python -m venv .venv ; .\.venv\Scripts\Activate.ps1 - Linux (Bash):
python3 -m venv .venv && source .venv/bin/activate
NOTE: This step is optional but recommended to avoid dependency conflicts with other projects. It will be created automatically if you use the run script though.
- Install dependencies:
pip install -r requirements.txtOr use the run script:
- Windows (PowerShell):
.\run.ps1 -Tasks install - Linux (Bash):
./run.sh install
NOTE: If using the run script, to provide custom pip flags, set the
PIP_INSTALL_FLAGSenvironment variable before running. For example:
- Windows (PowerShell):
$env:PIP_INSTALL_FLAGS="--upgrade --trusted-host my.host" .\run.ps1 -Tasks install- Linux (Bash):
export PIP_INSTALL_FLAGS="--upgrade --trusted-host my.host" ./run.sh installYou don't actually need to run the
installtask if you are using the run script, as it will automatically install dependencies if they are not already installed when you run it with no additional params.
-
Create tokens at GitHub Personal Access Tokens page for all owners/organizations you want to scan.
-
For better security, consider using fine-grained tokens with the minimum required permissions (Read-only: Contents, Pull requests). Make sure to select the desired organization as the resource owner.
-
Set it as an environment variable:
- Windows (PowerShell):
$env:GITHUB_TOKEN="your_token_here"
- Linux (Bash):
export GITHUB_TOKEN="your_token_here"
- Windows (PowerShell):
-
Or create a
.envfile with your token:GITHUB_TOKEN=your_token_hereNOTE: The
.envvariables are loaded automatically only when using the run script. -
Or pass it as a command-line
--tokenargument or in the GUI.
NOTE: If you are using multiple tokens, instead of passing a single token to the environment variable or command line parameter, you can pass a map of owner tokens (comma-separated
owner:tokenpairs, e.g.,owner1:token1,owner2:token2). The scraper will use the appropriate token for each repository based on its owner.
A web‑based interface is available via NICEGUI. It exposes all CLI options (username, repositories, date, tokens, commit fields, report formats, etc.) with field validation and streams logs live. After completion, navigate to the output directory to view results.
- Windows (PowerShell):
python -m src.main_gui - Linux (Bash):
python3 -m src.main_gui
or
- Windows (PowerShell):
.\run.ps1 -Tasks gui - Linux (Bash):
./run.sh gui
This will launch the GUI in your browser at http://localhost:8080/ and guide you through the inputs.
NOTE: If run via the run script, the GUI will automatically load the
.envfile and set the environment variables. Providing the repositories and/or token(s) via the GUI will override the environment variables.
python -m src.main <username> --since "YYYY-MM-DD" `
[--token "your_token_or_map_of_tokens"] `
[--repositories "owner1/repo1,owner2/repo2"] `
[--output OUTPUT] [--ca-bundle CA_BUNDLE_PATH] [--no-verify-ssl] [--fetch-pr-commits] [--include-merge-commits] `
[--commit-fields date url message sha stats files_changed] `
[--report-formats markdown text json] `
[--limit-download-diffs LIMIT_OF_FILES LIMIT_LINES_CHANGED]NOTE: On Linux, just call
python3instead ofpython.
-
username: The GitHub username to scan
-
--since: Only include contributions after this date (required)
-
--token: GitHub Personal Access Token or a map of owner tokens (can also be set via environment variable)
-
--repositories: Comma-separated list of repositories (can also be set via environment variable)
-
--output: Output directory for the results (default:
output/) -
--ca-bundle: Path to a custom SSL CA bundle file
-
--no-verify-ssl: Disable SSL verification (not recommended)
-
--fetch-pr-commits: Fetch any commits made in pull requests
-
--include-merge-commits: Include merge commits
-
--commit-fields: Space-separated list of commit fields to include in reports. Possible values:
date,url,message,sha,stats,files_changed. Default fields:date,url,message. -
--report-formats: Space-separated list of report formats to generate. Options:
text,markdown,json. Default:text -
--limit-download-diffs: Limit downloading diffs for commits whose file count or total lines changed exceed the given thresholds. Provide two integers:
LIMIT_OF_FILESandLIMIT_LINES_CHANGED. Defaults are30and3000. Commits above either limit are skipped with an informational log.
The --since parameter accepts the following date formats:
-
YYYY-MM-DD(e.g., "2023-01-01") -
YYYY-MM-DD HH:MM:SS(e.g., "2023-01-01 00:00:00") -
YYYY-MM-DDTHH:MM:SS(e.g., "2023-01-01T00:00:00")
python -m src.main octocat --repositories "octocat/Hello-World" --since "2022-01-01" --fetch-pr-commits --token "your_token_here"NOTE: On Linux, just call
python3instead ofpython.
This will fetch direct commits, pull request commits made by octocat since the given date, store diffs in an output directory, and generate a report listing all direct commits and commits made in octocat PRs.
You can also set the repositories to scan using an environment variable:
- Windows (PowerShell):
$env:GITHUB_REPOSITORIES="owner/repo1,owner/repo2" python -m src.main <username> --since "2023-01-01" --token "your_token_here"
- Linux (Bash):
export GITHUB_REPOSITORIES="owner/repo1,owner/repo2" python3 -m src.main <username> --since "2023-01-01" --token "your_token_here"
Or in your .env file:
GITHUB_REPOSITORIES=owner/repo1,owner/repo2
BTW: Unfortunately, a working solution to enable automatic scraping all repositories user contributed to has not been found. It would seem GitHub API doesn't provide a straightforward way to retrieve all repositories a user has contributed to, especially if they are private or part of an organization. The best approach seems to be to manually specify the repositories user might want to scan.
The included run script provides a convenient way to run the application:
- Windows (PowerShell):
.\run.ps1 -Tasks cli -PythonArgs "<username> --repositories owner/repo1,owner/repo2 --since 2023-01-01 --fetch-pr-commits"
- Linux (Bash):
./run.sh cli -PythonArgs "<username> --repositories owner/repo1,owner/repo2 --since 2023-01-01 --fetch-pr-commits"
It will also automatically create a virtual environment if it doesn't exist and install the required dependencies.
The repository includes VSCode tasks for common operations:
-
Run CLI: Runs the application with the prompted arguments
-
Run GUI: Opens the GUI interface
The metadata.json file contains metadata about the query, including:
-
Username
-
List of repositories
-
Start date (
since)
The commits/ directory will contain .diff files for each commit (both direct and, if included, from PRs) named by their SHA. Each .diff file includes:
-
File names and paths modified in the commit
-
Status of each file (added, modified, removed)
-
Changes summary (+additions/-deletions)
-
Patch/diff contents showing the actual code changes
The tool can generate any combination of the following report files based on the --report-formats setting:
-
report.txt: Plain‑text report (when
textis selected) -
report.md: Markdown report (when
markdownis selected) -
report.json: JSON report (when
jsonis selected)
If you're behind a corporate proxy, you may need to configure proxy settings:
- Add to your
.envfile:
HTTP_PROXY=http://your-proxy-server:port
HTTPS_PROXY=http://your-proxy-server:port
- This helps resolve
ConnectTimeoutErrorissues when connecting to GitHub.
If you encounter SSL verification errors:
-
1st option: Point pip to use system certificates by setting
SSL_CERT_FILEenvironment variable:- Windows (PowerShell):
$env:SSL_CERT_FILE="C:\path\to\your\CAcert.crt"
- Linux (Bash):
export SSL_CERT_FILE="/path/to/your/CAcert.crt"
- Windows (PowerShell):
-
2nd option: Use the
--ca-bundleparameter to specify a certificate bundle:python -m src.main <username> --repositories "owner/repo" --since "2023-01-01" --ca-bundle "C:\path\to\your\CAcert.crt"
NOTE: On Linux, just call
python3instead ofpython. -
3rd option: If you're still facing issues, consider installing an additional package like
pip-system-certsto use system certificates:pip install pip-system-certs
This will automatically configure pip to use the system's CA certificates.
-
4th option: As a last resort (not secure), use
--no-verify-ssl.
-
The tool does not automatically scrape all repositories a user has contributed to. You need to specify the repositories you want to scan.
-
Only commits made to the default branch of the specified repositories are fetched.
-
End date is not supported. The tool will fetch all commits made after the specified start date until the current date.