You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add features parameter when loading from text/json/pandas/csv or when using the map transform
add support for nested features for json
add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
add indexing using FAISS or ElasticSearch:
add add_faiss_index and add_elasticsearch_index methods
add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
add search and search_batch to query the index and return examples ids
add save_faiss_index/load_faiss_index to save/load a serialized faiss index
Datasets changes
new: PG19
new: ANLI
new: WikiSQL
new: qa_zre
new: MWSC
new: AG news
new: SQuADShifts
new: doc red
new: Wiki DPR
new: fever
new: hyperpartisan news detection
new: pandas
new: text
new: emotion
new: quora
new: BioMRC
new: web questions
new: search QA
new: LinCE
new: TREC
new: Style Change Detection
new: 20newsgroup
new: social biais frames
new: Emo
new: web of science
new: sogou news
new: crd3
update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
update: xtreme - add PAWS-X.es
update: xsum - manual download is no longer required.
new processed: Natural Questions
Metrics Features
add seed parameter for metrics that does sampling like rouge
better installation messages
Metrics changes
new: bleurt
update seqeval: fix entities extraction (more info here)
Bug fixes
fix bug in map and select that was causing memory issues
fix pyarrow version check
fix text/json/pandas/csv caching when loading different files in a row
fix metrics caching when they have with different config names
fix cache that was nto discarded when there's a KeybordInterrupt during .map
fix sacrebleu tokenizer's parameter
fix docstrings of metrics when multiple instances are created
More Tests
add tests for features handling in dataset transforms
add tests for dataset builders
add tests for metrics loading
Backward compatibility
because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json