Releases: NVIDIA-NeMo/Curator
Releases · NVIDIA-NeMo/Curator
NVIDIA NeMo Curator 0.9.0
Major Features and Enhancements
- New How-to Data Recipes (Tutorials)
- Multimodal DAPT Curation w/ PDF Extraction
- Llama Nemotron Data Curation
- LLM NIM - PII Redaction
- Performance and Code Optimizations
- Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
- Removed convoluted backend switching logic that caused performance issues
- Eliminated expensive length assertions that could cause timeouts on large datasets
- Improved GPU utilization during KMeans clustering operations
- Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains
Bug Fixes
- FastText Download URL Fix
- Corrected the
fasttext
model download URL in nemotron-cc tutorial - Changed from
dl.fbaipublicfiles.com/fastText/
todl.fbaipublicfiles.com/fasttext/
- Ensures reliable model downloads for language identification
- Corrected the
- NeMo Retriever Tutorial Bug Fix
- Fixed lambda function bug in
RetrieverEvalSetGenerator
- Corrected score assignment from
df["question"].apply(lambda: 1)
todf["score"] = 1
- Fixed lambda function bug in
- API Usage Updates
- Updated examples and tutorials to use correct
DocumentDataset
API - Replaced deprecated
write_to_disk(result, output_dir, output_type="parquet")
withresult.to_parquet(output_dir)
- Updated exact deduplication workflows:
deduplicator.remove()
now returnsDocumentDataset
directly
- Updated examples and tutorials to use correct
NVIDIA NeMo Curator 0.8.0
- Llama Based PII Redaction
- Trafilatura Text Extractor
- Chinese & Japanese Stopwords for Text Extractors
- Writing gzip compressed jsonl datasets
- Training dataset curation for retriever customization using hard-negative mining
- Implemented a memory efficient pairwise similarity in Semantic Deduplication
NVIDIA NeMo Curator 0.8.0rc3.dev0
Prerelease: NVIDIA NeMo Curator 0.8.0rc3.dev0 (2025-04-15)
NVIDIA NeMo Curator 0.8.0rc2.dev0
Prerelease: NVIDIA NeMo Curator 0.8.0rc2.dev0 (2025-04-07)
NVIDIA NeMo Curator 0.7.1
- Fix Transformers + Cuda Context bug
- Fix rate limit in SDG Retriever Eval Tutorial
NVIDIA NeMo Curator 0.7.0
- Python 3.12 Support
- Curator on Blackwell
- Nemotron-CC Dataset Recipe
- Performant S3 for Fuzzy Deduplication
NVIDIA NeMo Curator 0.7.0rc2.dev0
Prerelease: NVIDIA NeMo Curator 0.7.0rc2.dev0 (2025-02-25)
NVIDIA NeMo Curator 0.7.0rc1.dev1
Prerelease: NVIDIA NeMo Curator 0.7.0rc1.dev1 (2025-02-19)
NVIDIA NeMo Curator 0.7.0rc0.dev1
Prerelease: NVIDIA NeMo Curator 0.7.0rc0.dev1 (2025-02-04)
NVIDIA NeMo Curator 0.6.0
What's changed
- Synthetic Data Generation for Text Retrieval
- LLM-based Filters
- Easiness
- Answerability
- Q&A Retrieval Generation Pipeline
- LLM-based Filters
- Parallel Dataset Curation for Machine Translation
- Load/Write Bitext Files
- Heuristic filtering (Histogram, Length Ratio)
- Classifier filtering (Comet, Cometoid)