Skip to content

A tool to download all commits made by a GitHub user across specified repositories within a date range using GitHub API. Additionally, it can fetch commits made in pull requests by the user instead of just the final PR merge commit.

License

intel/gh-contrib-scraper-report-gen

GHContribScraperReportGen

Introduction

A tool to download all commits made by a GitHub user across specified repositories within a date range using GitHub API. Additionally, it can fetch commits made in pull requests by the user instead of just the final PR merge commit.

The tool generates diffs for each commit and provides text reports of all commits grouped by repositories.

Prerequisites:

  • Python 3.9 or higher.

  • Windows or Linux OS.

Quick Installation

  1. Clone this repository, download the ZIP or a release.

  2. Run the appropriate script for your OS:

  • Windows: .\run.ps1
  • Linux: ./run.sh

The run script is a helper that automates the installation of dependencies and simplifies the usage of the tool. It also creates a Python virtual environment if it doesn't exist.

NOTE: On Linux, you might need to make the script executable first by running: chmod +x run.sh

Manual installation (alternatively)

  1. Clone this repository or download the ZIP.

  2. Create a virtual environment:

  • Windows (PowerShell):
    python -m venv .venv ; .\.venv\Scripts\Activate.ps1
  • Linux (Bash):
    python3 -m venv .venv && source .venv/bin/activate

NOTE: This step is optional but recommended to avoid dependency conflicts with other projects. It will be created automatically if you use the run script though.

  1. Install dependencies:
pip install -r requirements.txt

Or use the run script:

  • Windows (PowerShell):
    .\run.ps1 -Tasks install
  • Linux (Bash):
    ./run.sh install

NOTE: If using the run script, to provide custom pip flags, set the PIP_INSTALL_FLAGS environment variable before running. For example:

  • Windows (PowerShell):
    $env:PIP_INSTALL_FLAGS="--upgrade --trusted-host my.host"
    .\run.ps1 -Tasks install
  • Linux (Bash):
    export PIP_INSTALL_FLAGS="--upgrade --trusted-host my.host"
    ./run.sh install

You don't actually need to run the install task if you are using the run script, as it will automatically install dependencies if they are not already installed when you run it with no additional params.

Setting up GitHub Personal Access Token

  • Create tokens at GitHub Personal Access Tokens page for all owners/organizations you want to scan.

  • For better security, consider using fine-grained tokens with the minimum required permissions (Read-only: Contents, Pull requests). Make sure to select the desired organization as the resource owner.

  • Set it as an environment variable:

    • Windows (PowerShell):
      $env:GITHUB_TOKEN="your_token_here"
    • Linux (Bash):
      export GITHUB_TOKEN="your_token_here"
  • Or create a .env file with your token:

    GITHUB_TOKEN=your_token_here
    

    NOTE: The .env variables are loaded automatically only when using the run script.

  • Or pass it as a command-line --token argument or in the GUI.

NOTE: If you are using multiple tokens, instead of passing a single token to the environment variable or command line parameter, you can pass a map of owner tokens (comma-separated owner:token pairs, e.g., owner1:token1,owner2:token2). The scraper will use the appropriate token for each repository based on its owner.

Usage

Graphical User Interface (GUI)

A web‑based interface is available via NICEGUI. It exposes all CLI options (username, repositories, date, tokens, commit fields, report formats, etc.) with field validation and streams logs live. After completion, navigate to the output directory to view results.

  • Windows (PowerShell):
    python -m src.main_gui
  • Linux (Bash):
    python3 -m src.main_gui

or

  • Windows (PowerShell):
    .\run.ps1 -Tasks gui
  • Linux (Bash):
    ./run.sh gui

This will launch the GUI in your browser at http://localhost:8080/ and guide you through the inputs.

NOTE: If run via the run script, the GUI will automatically load the .env file and set the environment variables. Providing the repositories and/or token(s) via the GUI will override the environment variables.

Command Line Interface (CLI)

python -m src.main <username> --since "YYYY-MM-DD" `
  [--token "your_token_or_map_of_tokens"] `
  [--repositories "owner1/repo1,owner2/repo2"] `
  [--output OUTPUT] [--ca-bundle CA_BUNDLE_PATH] [--no-verify-ssl] [--fetch-pr-commits] [--include-merge-commits] `
  [--commit-fields date url message sha stats files_changed] `
  [--report-formats markdown text json] `
  [--limit-download-diffs LIMIT_OF_FILES LIMIT_LINES_CHANGED]

NOTE: On Linux, just call python3 instead of python.

  • username: The GitHub username to scan

  • --since: Only include contributions after this date (required)

  • --token: GitHub Personal Access Token or a map of owner tokens (can also be set via environment variable)

  • --repositories: Comma-separated list of repositories (can also be set via environment variable)

  • --output: Output directory for the results (default: output/)

  • --ca-bundle: Path to a custom SSL CA bundle file

  • --no-verify-ssl: Disable SSL verification (not recommended)

  • --fetch-pr-commits: Fetch any commits made in pull requests

  • --include-merge-commits: Include merge commits

  • --commit-fields: Space-separated list of commit fields to include in reports. Possible values: date, url, message, sha, stats, files_changed. Default fields: date, url, message.

  • --report-formats: Space-separated list of report formats to generate. Options: text, markdown, json. Default: text

  • --limit-download-diffs: Limit downloading diffs for commits whose file count or total lines changed exceed the given thresholds. Provide two integers: LIMIT_OF_FILES and LIMIT_LINES_CHANGED. Defaults are 30 and 3000. Commits above either limit are skipped with an informational log.

Date Format

The --since parameter accepts the following date formats:

  • YYYY-MM-DD (e.g., "2023-01-01")

  • YYYY-MM-DD HH:MM:SS (e.g., "2023-01-01 00:00:00")

  • YYYY-MM-DDTHH:MM:SS (e.g., "2023-01-01T00:00:00")

Example

python -m src.main octocat --repositories "octocat/Hello-World" --since "2022-01-01" --fetch-pr-commits --token "your_token_here"

NOTE: On Linux, just call python3 instead of python.

This will fetch direct commits, pull request commits made by octocat since the given date, store diffs in an output directory, and generate a report listing all direct commits and commits made in octocat PRs.

Setting Repositories via Environment Variable

You can also set the repositories to scan using an environment variable:

  • Windows (PowerShell):
    $env:GITHUB_REPOSITORIES="owner/repo1,owner/repo2"
    python -m src.main <username> --since "2023-01-01" --token "your_token_here"
  • Linux (Bash):
    export GITHUB_REPOSITORIES="owner/repo1,owner/repo2"
    python3 -m src.main <username> --since "2023-01-01" --token "your_token_here"

Or in your .env file:

GITHUB_REPOSITORIES=owner/repo1,owner/repo2

BTW: Unfortunately, a working solution to enable automatic scraping all repositories user contributed to has not been found. It would seem GitHub API doesn't provide a straightforward way to retrieve all repositories a user has contributed to, especially if they are private or part of an organization. The best approach seems to be to manually specify the repositories user might want to scan.

Using the Run Script

The included run script provides a convenient way to run the application:

  • Windows (PowerShell):
    .\run.ps1 -Tasks cli -PythonArgs "<username> --repositories owner/repo1,owner/repo2 --since 2023-01-01 --fetch-pr-commits"
  • Linux (Bash):
    ./run.sh cli -PythonArgs "<username> --repositories owner/repo1,owner/repo2 --since 2023-01-01 --fetch-pr-commits"

It will also automatically create a virtual environment if it doesn't exist and install the required dependencies.

VSCode Tasks

The repository includes VSCode tasks for common operations:

  • Run CLI: Runs the application with the prompted arguments

  • Run GUI: Opens the GUI interface

Output Format

Metadata JSON

The metadata.json file contains metadata about the query, including:

  • Username

  • List of repositories

  • Start date (since)

Commit Diffs

The commits/ directory will contain .diff files for each commit (both direct and, if included, from PRs) named by their SHA. Each .diff file includes:

  • File names and paths modified in the commit

  • Status of each file (added, modified, removed)

  • Changes summary (+additions/-deletions)

  • Patch/diff contents showing the actual code changes

Reports

The tool can generate any combination of the following report files based on the --report-formats setting:

  • report.txt: Plain‑text report (when text is selected)

  • report.md: Markdown report (when markdown is selected)

  • report.json: JSON report (when json is selected)

Network Configuration

Proxy Settings

If you're behind a corporate proxy, you may need to configure proxy settings:

  1. Add to your .env file:
HTTP_PROXY=http://your-proxy-server:port
HTTPS_PROXY=http://your-proxy-server:port
  1. This helps resolve ConnectTimeoutError issues when connecting to GitHub.

SSL Certificate Issues

If you encounter SSL verification errors:

  • 1st option: Point pip to use system certificates by setting SSL_CERT_FILE environment variable:

    • Windows (PowerShell):
      $env:SSL_CERT_FILE="C:\path\to\your\CAcert.crt"
    • Linux (Bash):
      export SSL_CERT_FILE="/path/to/your/CAcert.crt"
  • 2nd option: Use the --ca-bundle parameter to specify a certificate bundle:

    python -m src.main <username> --repositories "owner/repo" --since "2023-01-01" --ca-bundle "C:\path\to\your\CAcert.crt"

    NOTE: On Linux, just call python3 instead of python.

  • 3rd option: If you're still facing issues, consider installing an additional package like pip-system-certs to use system certificates:

    pip install pip-system-certs

    This will automatically configure pip to use the system's CA certificates.

  • 4th option: As a last resort (not secure), use --no-verify-ssl.

Notes

Known limitations

  • The tool does not automatically scrape all repositories a user has contributed to. You need to specify the repositories you want to scan.

  • Only commits made to the default branch of the specified repositories are fetched.

  • End date is not supported. The tool will fetch all commits made after the specified start date until the current date.

About

A tool to download all commits made by a GitHub user across specified repositories within a date range using GitHub API. Additionally, it can fetch commits made in pull requests by the user instead of just the final PR merge commit.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •