ProductURLMapper is a specialized web scraping and product matching tool designed for e-commerce sites, with particular focus on German language optimization. The tool extracts URLs from a website, compares them with product titles from a CSV file, and identifies matches using multiple sophisticated matching strategies.
- Multi-source URL Extraction: Harvests URLs from robots.txt, sitemaps, and direct web crawling
- Advanced German Language Support: Handles German-specific challenges like umlauts, compound words, and specialized terminology
- Multiple Matching Strategies: Employs 11 different matching algorithms with confidence scoring
- Health/Medical Terminology Optimization: Special handling for health and wellness product terminology
- Detailed Reporting: Generates comprehensive CSV reports of matched and unmatched products
- Confidence Scoring: Rates each match with a confidence score to prioritize the most reliable matches
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/umerkhan95/ProductURLMapper.git
cd ProductURLMapper
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
python product_url_matcher.py https://www.example.com
python product_url_matcher.py https://www.example.com --csv_path /path/to/your/products.csv
The CSV file must contain at least these columns:
Handle
: Product identifier/slugTitle
: Product title/name
- Extracts sitemap URLs from robots.txt
- Parses sitemaps to collect product URLs
- Crawls the website to find additional URLs
The tool employs multiple matching strategies in sequence:
- Direct Handle Matching: Exact match of product handle in URL
- Handle Word Matching: Matching significant parts of multi-word handles
- Title Matching: Matching normalized product title in URL
- Fuzzy Matching: Finding significant words from title in URLs
- Collection URL Matching: Checking for product handle in collection URLs
- Variant Product Detection: Identifying product variants (e.g., with -1, -2 suffixes)
- German Language Handling: Special processing for umlauts and German spellings
- Compound Word Handling: Handling German compound words with various splitting approaches
- Special Case Handling: Targeted matching for difficult German product names
- Path Component Analysis: Analyzing all parts of URL paths for matches
- Health/Medical Terminology: Specialized matching for health-related products
The tool generates three CSV files:
- All extracted URLs from the website
- Matched products with confidence scores
- Unmatched products for further investigation
python product_url_matcher.py https://www.ory-berlin.de
Output:
Loaded 139 product titles from products_export_1.csv
Step 1: Extracting URLs from robots.txt...
Found 1 sitemap URLs in robots.txt
Step 2: Extracting URLs from sitemaps...
Parsing sitemap: https://www.ory-berlin.de/sitemap.xml
Found 366 URLs in sitemap
Step 3: Crawling the website...
Collected a total of 366 unique URLs
Found 125 matches between product titles and URLs
Saved results to CSV files
MIT License
Umer Khan