Skip to content

Releases: NVIDIA-NeMo/Curator

NVIDIA NeMo Curator 0.9.0

28 Jul 20:18
23da8c2
Compare
Choose a tag to compare

Major Features and Enhancements

  • New How-to Data Recipes (Tutorials)
    • Multimodal DAPT Curation w/ PDF Extraction
    • Llama Nemotron Data Curation
    • LLM NIM - PII Redaction
  • Performance and Code Optimizations
    • Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
    • Removed convoluted backend switching logic that caused performance issues
    • Eliminated expensive length assertions that could cause timeouts on large datasets
    • Improved GPU utilization during KMeans clustering operations
    • Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains

Bug Fixes

  • FastText Download URL Fix
    • Corrected the fasttext model download URL in nemotron-cc tutorial
    • Changed from dl.fbaipublicfiles.com/fastText/ to dl.fbaipublicfiles.com/fasttext/
    • Ensures reliable model downloads for language identification
  • NeMo Retriever Tutorial Bug Fix
    • Fixed lambda function bug in RetrieverEvalSetGenerator
    • Corrected score assignment from df["question"].apply(lambda: 1) to df["score"] = 1
  • API Usage Updates
    • Updated examples and tutorials to use correct DocumentDataset API
    • Replaced deprecated write_to_disk(result, output_dir, output_type="parquet") with result.to_parquet(output_dir)
    • Updated exact deduplication workflows: deduplicator.remove() now returns DocumentDataset directly

NVIDIA NeMo Curator 0.8.0

09 May 01:11
cf12d34
Compare
Choose a tag to compare
  • Llama Based PII Redaction
  • Trafilatura Text Extractor
  • Chinese & Japanese Stopwords for Text Extractors
  • Writing gzip compressed jsonl datasets
  • Training dataset curation for retriever customization using hard-negative mining
  • Implemented a memory efficient pairwise similarity in Semantic Deduplication

NVIDIA NeMo Curator 0.8.0rc3.dev0

15 Apr 19:44
cff3cb6
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA NeMo Curator 0.8.0rc3.dev0 (2025-04-15)

NVIDIA NeMo Curator 0.8.0rc2.dev0

07 Apr 20:15
8cbd68f
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA NeMo Curator 0.8.0rc2.dev0 (2025-04-07)

NVIDIA NeMo Curator 0.7.1

31 Mar 22:52
d0cc62d
Compare
Choose a tag to compare
  • Fix Transformers + Cuda Context bug
  • Fix rate limit in SDG Retriever Eval Tutorial

NVIDIA NeMo Curator 0.7.0

12 Mar 21:22
f207c99
Compare
Choose a tag to compare
  • Python 3.12 Support
  • Curator on Blackwell
  • Nemotron-CC Dataset Recipe
  • Performant S3 for Fuzzy Deduplication

NVIDIA NeMo Curator 0.7.0rc2.dev0

25 Feb 13:12
6a05d29
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc2.dev0 (2025-02-25)

NVIDIA NeMo Curator 0.7.0rc1.dev1

19 Feb 18:21
c3ebcb5
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc1.dev1 (2025-02-19)

NVIDIA NeMo Curator 0.7.0rc0.dev1

04 Feb 21:41
7ab04ce
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc0.dev1 (2025-02-04)

NVIDIA NeMo Curator 0.6.0

07 Jan 15:41
4f25a91
Compare
Choose a tag to compare

What's changed

  • Synthetic Data Generation for Text Retrieval
    • LLM-based Filters
      • Easiness
      • Answerability
    • Q&A Retrieval Generation Pipeline
  • Parallel Dataset Curation for Machine Translation
    • Load/Write Bitext Files
    • Heuristic filtering (Histogram, Length Ratio)
    • Classifier filtering (Comet, Cometoid)