An intelligent AI agent for automated entity matching. This system leverages a configurable agent and Hudl's AWS Bedrock-provisioned models to dynamically identify, match, and resolve different entity types like teams and fixtures from various data sources.
The Autonomous Entity Matching AI Agent performs intelligent entity matching through a modular, single-agent approach:
- Dynamic Agent Configuration: A main orchestrator (
src/main.py
) configures the agent with the correct tools and audit rules based on user-defined tasks (e.g., entity type, scoring method). - Flexible Data Sourcing: Ingests entity IDs from PostgreSQL databases or local CSV files.
- Advanced Matching Logic: Utilizes sophisticated, weighted or binary audit rules tailored to each entity type. For instance, fixture matching is based on a detailed audit of the associated teams.
- Extensible Framework: Easily add new entity types or audit rules by extending the configuration in
src/tools.py
andsrc/sys_prompts.py
.
Before running this application, ensure you have:
- AWS CLI configured with Hudl main account access
- Python 3.8+ installed
- Access to a PostgreSQL database (if using it as a data source)
Authenticate with the Hudl main account using AWS CLI:
aws sso login --profile hudl_aws_main_account_profile_name
Create and activate a Python virtual environment:
python -m venv entity-matching-env
source entity-matching-env/bin/activate # On macOS/Linux
# or
entity-matching-env\Scripts\activate # On Windows
This project uses a .env
file to manage environment variables. Create a file named .env
in the root directory of the project and add the following variables.
# Hudl API Key for GraphQL endpoint
HUDL_API_KEY="your_hudl_api_key_here"
# AWS Profile for Boto3
AWS_PROFILE="your_hudl_aws_main_account_profile_name"
# PostgreSQL Database Credentials (only required if using postgres data source)
DB_HOST="your_db_host"
DB_NAME="your_db_name"
DB_USER="your_db_user"
DB_PASS="your_db_password"
DB_PORT="5432"
Install the required Python packages:
pip install -r requirements.txt
The primary way to run the system is through the command-line interface of the main orchestrator script.
Execute the agent from your terminal using the following command structure:
python src/main.py {entity_type} {data_source} {source_input} {output_path} [--scoring_method {weighted|binary}]
Arguments:
entity_type
: The type of entity to process (team
orfixture
).data_source
: The source of the entity IDs (csv
orpostgres
).source_input
: The location of the data.- For
csv
: The absolute file path to your input CSV. - For
postgres
: The name of the table containing the IDs.
- For
output_path
: The file path for the CSV output results.--scoring_method
(optional): The scoring method for matching (weighted
orbinary
). Defaults toweighted
.
1. Matching Teams from a CSV File:
python src/main.py team csv "/path/to/your/teams_to_match.csv" "results/team_output.csv" --scoring_method weighted
The input CSV file must contain a header named source_gsl_id
.
2. Matching Fixtures from a PostgreSQL Table:
python src/main.py fixture postgres "sports_fixtures_table" "results/fixture_output.csv"
The agent will print its progress to the console and save the final results to the CSV file specified in the output_path
argument.
The original Jupyter notebooks (langchain_ob/
) are still available for development, testing, and exploration purposes but are not part of the main autonomous workflow.