Merge branch 'master' into arjunsuresh-patch-2

arjunsuresh · web-flow · commit 7963c7ef2065 · 2025-08-20T23:48:19.000+01:00
diff --git a/language/deepseek-r1/README.md b/language/deepseek-r1/README.md
@@ -1,6 +1,6 @@
-# Mlperf Inference DeepSeek Reference Implementation
+# MLPerf Inference DeepSeek Reference Implementation
 
-## Automated command to run the benchmark via MLFlow
+## Automated command to run the benchmark via MLCFlow
 
 Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/deepseek-r1/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
 
@@ -13,6 +13,22 @@ You can also do pip install mlc-scripts and then use `mlcr` commands for downloa
 - DeepSeek-R1 model is automatically downloaded as part of setup
 - Checkpoint conversion is done transparently when needed.
 
+**Using the MLC R2 Downloader**
+
+Download the model using the MLCommons R2 Downloader:
+
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
+```
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
+```
+
 ## Dataset Download
 
 The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livecodebench(code_generation_lite). They are covered by the following licenses:
@@ -23,49 +39,40 @@ The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livec
 - MMLU-Pro: [MIT](https://opensource.org/license/mit)
 - livecodebench(code_generation_lite): [CC](https://creativecommons.org/share-your-work/cclicenses/)
 
-### Preprocessed
-
-**Using MLCFlow Automation**
-
-```
-mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
-```
+### Preprocessed & Calibration
 
-**Using Native method**
+**Using the MLC R2 Downloader**
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+Download the full preprocessed dataset and calibration dataset using the MLCommons R2 Downloader:
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+-d ./ https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
 ```
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
+This will download the full preprocessed dataset file (`mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`) and the calibration dataset file (`mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`).
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
 ```
 
-### Calibration
+### Preprocessed
 
 **Using MLCFlow Automation**
 
 ```
-mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,preprocessed,dataset,deepseek-r1,_validation,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
-**Using Native method**
-
-Download and install Rclone as described in the previous section.
+### Calibration
 
-Then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+**Using MLCFlow Automation**
 
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 ## Docker
@@ -204,7 +211,7 @@ The following table shows which backends support different evaluation and MLPerf
 **Using MLCFlow Automation**
 
 ```
-TBD
+mlcr run,accuracy,mlperf,_dataset_deepseek-r1 --result_dir=<Path to directory where files are generated after the benchmark run>
 ```
 
 **Using Native method**
diff --git a/language/llama3.1-8b/README.md b/language/llama3.1-8b/README.md
@@ -104,7 +104,7 @@ You need to request for access to [MLCommons](http://llama3-1.mlcommons.org/) an
 **Official Model download using MLCFlow Automation**
 You can download the model automatically via the below command
 ```
-TBD
+mlcr get,ml-model,llama3,_mlc,_8b,_r2-downloader --outdirname=<path to download> -j
 ```
 
 
@@ -137,59 +137,57 @@ Downloading llama3.1-8b model from Hugging Face will require an [**access token*
 
 ### Preprocessed
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
-
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-```
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+Download the preprocessed datasets using the MLCommons downloader:
 
 #### Full dataset (datacenter) 
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval.json ./ -P
-```
+This will download `cnn_eval.json`.
 
 #### 5000 samples (edge)
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-sample-cnn-eval-5000.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval_5000.json ./ -P
-```
+
+This will download `sample_cnn_eval_5000.json`.
+
 
 #### Calibration
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_calibration,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_calibration,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-dailymail-calibration.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_dailymail_calibration.json ./ -P
-```
-
-You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:
+This will download `cnn_dailymail_calibration.json`.
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/cnn_eval.json ./ -P
+To specify a custom download directory for any of these, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  <URI>
 ```
 
 
diff --git a/speech2text/README.md b/speech2text/README.md
@@ -102,26 +102,24 @@ VLLM_TARGET_DEVICE=cpu pip install --break-system-packages . --no-build-isolatio
 
 You can download the model automatically via the below command
 ```
-mlcr get,ml-model,whisper,_rclone,_mlc --outdirname=<path_to_download> -j
+mlcr get,ml-model,whisper,_r2-downloader,_mlc --outdirname=<path_to_download> -j
 ```
 
-**Official Model download using native method**
+**Official Model download using MLC R2 Downloader**
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+Download the Whisper model using the MLCommons downloader:
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/model https://inference.mlcommons-storage.org/metadata/whisper-model.uri
 ```
-You can then navigate in the terminal to your desired download directory and run the following command to download the model:
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/model/ ./ -P
+This will download the Whisper model files.
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/whisper-model.uri
 ```
 
 ### External Download (Not recommended for official submission)
@@ -153,16 +151,24 @@ We use dev-clean and dev-other splits, which are approximately 10 hours.
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,whisper,_preprocessed,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
-**Native method**
+**Using MLC R2 Downloader**
 
-Download and install rclone as decribed in the [MLCommons Download section](#mlcommons-download)
+Download the preprocessed dataset using the MLCommons R2 Downloader:
 
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/dataset https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/dataset/ ./ -P
+
+This will download the LibriSpeech dataset files.
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
 ```
 
 ### Unprocessed