dataprep-ai

Why we built `dataprep-ai`

Every data project starts with excitement — and then comes the messy part:
missing values, inconsistent categories, outliers, and duplicates.

We found ourselves writing the same boilerplate Pandas/Polars code again and again, just to get to the real work: analysis, modeling, insight. It was frustrating, repetitive, and error-prone.

So we asked: What if data cleaning could be a one-liner?
That’s how dataprep-ai was born.

What it does

🧹 Cleans your dataset in one line
🏷 Normalizes messy categories (NY → New York, etc.)
🔍 Handles missing values & outliers with smart strategies
📊 Produces transparent logs and reproducible reports
⚡ Works with Pandas and Polars out of the box
🛠 CI-tested and PyPI-ready

dataprep-ai

One-line, opinionated data cleaning for pandas/Polars.
Fix missing values, inconsistent categories, outliers, and duplicates with transparent logs and a reproducible report.

Installation

pip install dataprep-ai

----
For the optional explorer app:
pip install "dataprep-ai[app]"

Requirements

Python: 3.9 – 3.12

OS: Linux, macOS, Windows

Required libs (auto-installed): pandas, numpy, pyarrow, scikit-learn, pydantic, rich

Optional:

polars (enabled automatically where supported) — Polars round-trip I/O

streamlit, matplotlib — only needed for the explorer

Quickstart:

import pandas as pd
from dataprep_ai import clean, CleaningConfig

df = pd.DataFrame({
    "age":[23, None, 25, 1000],
    "income":[52000, 58000, None, 1200000],
    "city":["NY","New York","nyc", None],
    "id":[1,2,2,4]
})

result = clean(df, CleaningConfig(
    id_columns=["id"],
    outlier_strategy="iqr_cap",
    categorical_normalization=True,
    drop_duplicates=False
))

print(result.summary_markdown)  # see cleaning report
df_clean = result.df            # cleaned DataFrame
result.to_json("clean_report.json")

Streamlit Explorer:

pip install "dataprep-ai[app]"
streamlit run -m dataprep_ai.explore -- --csv your.csv

Backends

Input = pandas.DataFrame → Output = pandas.DataFrame

Input = polars.DataFrame → Output = polars.DataFrame (internally converts via pandas in v0.1)

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
src		src
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Why we built `dataprep-ai`

What it does

dataprep-ai

Installation

About

Uh oh!

Releases 4

Packages

Uh oh!

Languages

License

RohitRajdev/dataprep-ai

Folders and files

Latest commit

History

Repository files navigation

Why we built dataprep-ai

What it does

dataprep-ai

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Languages

Why we built `dataprep-ai`

Packages