Skip to content

Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.

License

Notifications You must be signed in to change notification settings

maltzsama/sumeh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python License

Logo Sumeh DQ

Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.

🚀 Installation

# Using pip
pip install sumeh

# Or with conda-forge
conda install -c conda-forge sumeh

Prerequisites:

  • Python 3.10+
  • One or more of: pyspark, dask[dataframe], polars, duckdb, cuallee

🔍 Core API

  • report(df, rules, name="Quality Check")
    Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB).
  • validate(df, rules) (per-engine)
    Returns a DataFrame with a dq_status column listing violations.
  • summarize(qc_df, rules, total_rows) (per-engine)
    Consolidates violations into a summary report.

⚙️ Supported Engines

Each engine implements the validate() + summarize() pair:

Engine Module Status
PySpark sumeh.engine.pyspark_engine ✅ Fully implemented
Dask sumeh.engine.dask_engine ✅ Fully implemented
Polars sumeh.engine.polars_engine ✅ Fully implemented
DuckDB sumeh.engine.duckdb_engine ✅ Fully implemented
Pandas sumeh.engine.pandas_engine ✅ Fully implemented
BigQuery (SQL) sumeh.engine.bigquery_engine 🔧 Stub implementation

🏗 Configuration Sources

Load rules from CSV, S3, MySQL, Postgres, BigQuery table, or AWS Glue:

from sumeh import get_rules_config

# ✅ CSV local
rules = get_rules_config("rules.csv", delimiter=";")

# ✅ S3 (formato s3://bucket/path/to/rules.csv)
rules = get_rules_config("s3://my-bucket/rules.csv", delimiter=";")

# ✅ DuckDB
import duckdb
conn = duckdb.connect("my_db.duckdb")
rules = get_rules_config("duckdb", table="rules", conn=conn)

# ✅ MySQL
rules = get_rules_config("mysql", host="localhost", user="root", password="pass", database="dq", table="rules")

# ✅ PostgreSQL
rules = get_rules_config("postgresql", host="localhost", user="admin", password="pass", database="dq", table="rules")

# ✅ BigQuery
rules = get_rules_config("bigquery", project_id="my-project", dataset_id="dq", table_id="rules")

# ✅ AWS Glue
from awsglue.context import GlueContext
glue_context = GlueContext(...)  # spark session must be initialized
rules = get_rules_config("glue", glue_context=glue_context, database_name="dq", table_name="rules")

# ✅ Databricks (Delta Table or Hive Metastore)
rules = get_rules_config("databricks", catalog="main", schema="dq", table="rules")

🏃‍♂️ Typical Workflow

import pandas as pd
from sumeh import get_rules_config, report, validate, summarize

# 1) Load your dataset
df = pd.read_parquet("data/searches_1.parquet")  # or read_csv, read_json...

# 2) Load your rules
rules = get_rules_config("rules.csv", delimiter=";")
rules = [r for r in rules if r.get("execute", True)]  # optional filtering

# 3) Run validations
qc_result = report(df, rules, name="Validação Inicial")

# 4) Raw and summarized violations
agg_result, raw_violations = validate(df, rules)
summary = summarize(raw_violations, rules, total_rows=len(df))

# 5) Display
print(qc_result)       # from cuallee's CheckResult
print(summary)         # if using Pandas or DuckDB

Or simply:

from sumeh import report, get_rules_config
import pandas as pd

df = pd.read_csv("data.csv")
rules = get_rules_config("rules.csv", delimiter=";")
rules = [r for r in rules if r.get("execute", True)]

result = report(df, rules, name="My Check")
print(result)  # show as DataFrame

📋 Rule Definition Example

{
  "field": "customer_id",
  "check_type": "is_complete",
  "threshold": 0.99,
  "value": null,
  "execute": true
}

Supported Validation Rules

Numeric checks

Test Description
is_in_millions Retains rows where the column value is less than 1,000,000 (fails the "in millions" criteria).
is_in_billions Retains rows where the column value is less than 1,000,000,000 (fails the "in billions" criteria).

Completeness & Uniqueness

Test Description
is_complete Filters rows where the column value is null.
are_complete Filters rows where any of the specified columns are null.
is_unique Identifies rows with duplicate values in the specified column.
are_unique Identifies rows with duplicate combinations of the specified columns.
is_primary_key Alias for is_unique (checks uniqueness of a single column).
is_composite_key Alias for are_unique (checks combined uniqueness of multiple columns).

Comparison & Range

Test Description
is_equal Filters rows where the column is not equal to the provided value (null-safe).
is_equal_than Alias for is_equal.
is_between Filters rows where the column value is outside the numeric range [min, max].
is_greater_than Filters rows where the column value is the threshold (fails "greater than").
is_greater_or_equal_than Filters rows where the column value is < the threshold (fails "greater or equal").
is_less_than Filters rows where the column value is the threshold (fails "less than").
is_less_or_equal_than Filters rows where the column value is > the threshold (fails "less or equal").
is_positive Filters rows where the column value is < 0 (fails "positive").
is_negative Filters rows where the column value is ≥ 0 (fails "negative").

Membership & Pattern

Test Description
is_contained_in Filters rows where the column value is not in the provided list.
not_contained_in Filters rows where the column value is in the provided list.
has_pattern Filters rows where the column value does not match the specified regex.
is_legit Filters rows where the column value is null or contains whitespace (i.e., not \S+).

Aggregate checks

Test Description
has_min Returns all rows if the column's minimum value causes failure (value < threshold); otherwise returns empty.
has_max Returns all rows if the column's maximum value causes failure (value > threshold); otherwise returns empty.
has_sum Returns all rows if the column's sum causes failure (sum > threshold); otherwise returns empty.
has_mean Returns all rows if the column's mean causes failure (mean > threshold); otherwise returns empty.
has_std Returns all rows if the column's standard deviation causes failure (std > threshold); otherwise returns empty.
has_cardinality Returns all rows if the number of distinct values causes failure (count > threshold); otherwise returns empty.
has_infogain Same logic as has_cardinality (proxy for information gain).
has_entropy Same logic as has_cardinality (proxy for entropy).

SQL & Schema

Test Description
satisfies Filters rows where the SQL expression (based on rule["value"]) is not satisfied.
validate_schema Compares the DataFrame's actual schema against the expected one and returns a match flag + error list.
validate Executes a list of named rules and returns two DataFrames: one with aggregated status and one with raw violations.

Date-related checks

Test Description
is_t_minus_1 Retains rows where the date in the column is not equal to yesterday (T–1).
is_t_minus_2 Retains rows where the date in the column is not equal to two days ago (T–2).
is_t_minus_3 Retains rows where the date in the column is not equal to three days ago (T–3).
is_today Retains rows where the date in the column is not equal to today.
is_yesterday Retains rows where the date in the column is not equal to yesterday.
is_on_weekday Retains rows where the date in the column NOT FALLS on a weekend (fails "weekday").
is_on_weekend Retains rows where the date in the column NOT FALLS on a weekday (fails "weekend").
is_on_monday Retains rows where the date in the column is not Monday.
is_on_tuesday Retains rows where the date in the column is not Tuesday.
is_on_wednesday Retains rows where the date in the column is not Wednesday.
is_on_thursday Retains rows where the date in the column is not Thursday.
is_on_friday Retains rows where the date in the column is not Friday.
is_on_saturday Retains rows where the date in the column is not Saturday.
is_on_sunday Retains rows where the date in the column is not Sunday.
validate_date_format Filters rows where the date doesn't match the expected format or is null.
is_future_date Filters rows where the date in the column is not after today.
is_past_date Filters rows where the date in the column is not before today.
is_date_after Filters rows where the date in the column is not before the date provided in the rule.
is_date_before Filters rows where the date in the column is not after the date provided in the rule.
is_date_between Filters rows where the date in the column is not outside the range [start, end].
all_date_checks Alias for is_past_date (same logic: date before today).

📂 Project Layout

sumeh/
├── poetry.lock
├── pyproject.toml
├── README.md
└── sumeh
    ├── __init__.py
    ├── cli.py
    ├── core.py
    ├── engine
    │   ├── __init__.py
    │   ├── bigquery_engine.py
    │   ├── dask_engine.py
    │   ├── duckdb_engine.py
    │   ├── pandas_engine.py
    │   ├── polars_engine.py
    │   └── pyspark_engine.py
    └── services
        ├── __init__.py
        ├── config.py
        ├── index.html
        └── utils.py

📈 Roadmap

  • Complete BigQuery engine implementation
  • ✅ Complete Pandas engine implementation
  • ✅ Enhanced documentation
  • ✅ More validation rule types
  • Performance optimizations

🤝 Contributing

  1. Fork & create a feature branch
  2. Implement new checks or engines, following existing signatures
  3. Add tests under tests/
  4. Open a PR and ensure CI passes

📜 License

Licensed under the Apache License 2.0.

Packages

No packages published