ProductURLMapper

Overview

ProductURLMapper is a specialized web scraping and product matching tool designed for e-commerce sites, with particular focus on German language optimization. The tool extracts URLs from a website, compares them with product titles from a CSV file, and identifies matches using multiple sophisticated matching strategies.

Key Features

Multi-source URL Extraction: Harvests URLs from robots.txt, sitemaps, and direct web crawling
Advanced German Language Support: Handles German-specific challenges like umlauts, compound words, and specialized terminology
Multiple Matching Strategies: Employs 11 different matching algorithms with confidence scoring
Health/Medical Terminology Optimization: Special handling for health and wellness product terminology
Detailed Reporting: Generates comprehensive CSV reports of matched and unmatched products
Confidence Scoring: Rates each match with a confidence score to prioritize the most reliable matches

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository:

git clone https://github.com/umerkhan95/ProductURLMapper.git
cd ProductURLMapper

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Basic Command

python product_url_matcher.py https://www.example.com

With Custom CSV File

python product_url_matcher.py https://www.example.com --csv_path /path/to/your/products.csv

Required CSV Format

The CSV file must contain at least these columns:

Handle: Product identifier/slug
Title: Product title/name

How It Works

URL Collection

Extracts sitemap URLs from robots.txt
Parses sitemaps to collect product URLs
Crawls the website to find additional URLs

Matching Strategies

The tool employs multiple matching strategies in sequence:

Direct Handle Matching: Exact match of product handle in URL
Handle Word Matching: Matching significant parts of multi-word handles
Title Matching: Matching normalized product title in URL
Fuzzy Matching: Finding significant words from title in URLs
Collection URL Matching: Checking for product handle in collection URLs
Variant Product Detection: Identifying product variants (e.g., with -1, -2 suffixes)
German Language Handling: Special processing for umlauts and German spellings
Compound Word Handling: Handling German compound words with various splitting approaches
Special Case Handling: Targeted matching for difficult German product names
Path Component Analysis: Analyzing all parts of URL paths for matches
Health/Medical Terminology: Specialized matching for health-related products

Output Files

The tool generates three CSV files:

All extracted URLs from the website
Matched products with confidence scores
Unmatched products for further investigation

Example

python product_url_matcher.py https://www.ory-berlin.de

Output:

Loaded 139 product titles from products_export_1.csv
Step 1: Extracting URLs from robots.txt...
Found 1 sitemap URLs in robots.txt
Step 2: Extracting URLs from sitemaps...
Parsing sitemap: https://www.ory-berlin.de/sitemap.xml
Found 366 URLs in sitemap
Step 3: Crawling the website...
Collected a total of 366 unique URLs
Found 125 matches between product titles and URLs
Saved results to CSV files

License

MIT License

Author

Umer Khan

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
product_url_matcher.py		product_url_matcher.py
requirements.txt		requirements.txt
url_to_markdown.py		url_to_markdown.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProductURLMapper

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Basic Command

With Custom CSV File

Required CSV Format

How It Works

URL Collection

Matching Strategies

Output Files

Example

License

Author

About

Uh oh!

Releases

Packages

Languages

umerkhan95/ProductURLMapper

Folders and files

Latest commit

History

Repository files navigation

ProductURLMapper

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Basic Command

With Custom CSV File

Required CSV Format

How It Works

URL Collection

Matching Strategies

Output Files

Example

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages