[EZ] Replace pytorch-labs with meta-pytorch (#2914)

ZainRizvi · web-flow · commit 97bd210b3512 · 2025-08-12T17:32:31.000-05:00
diff --git a/README.md b/README.md
@@ -324,7 +324,7 @@ torchtune focuses on integrating with popular tools and libraries from the ecosy
 - [EleutherAI's LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) for [evaluating](recipes/eleuther_eval.py) trained models
 - [Hugging Face Datasets](https://huggingface.co/docs/datasets/en/index) for [access](torchtune/datasets/_instruct.py) to training and evaluation datasets
 - [PyTorch FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md) for distributed training
-- [torchao](https://github.com/pytorch-labs/ao) for lower precision dtypes and [post-training quantization](recipes/quantize.py) techniques
+- [torchao](https://github.com/pytorch/ao) for lower precision dtypes and [post-training quantization](recipes/quantize.py) techniques
 - [Weights & Biases](https://wandb.ai/site) for [logging](https://pytorch.org/torchtune/main/deep_dives/wandb_logging.html) metrics and checkpoints, and tracking training progress
 - [Comet](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/main/deep_dives/comet_logging.html)
 - [ExecuTorch](https://pytorch.org/executorch-overview) for [on-device inference](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#optional-finetuning) using finetuned models
@@ -351,7 +351,7 @@ We really value our community and the contributions made by our wonderful users.
 The transformer code in this repository is inspired by the original [Llama2 code](https://github.com/meta-llama/llama/blob/main/llama/model.py). We also want to give a huge shout-out to EleutherAI, Hugging Face and
 Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune. In addition, we want to acknowledge some other awesome libraries and tools from the ecosystem:
 
-- [gpt-fast](https://github.com/pytorch-labs/gpt-fast) for performant LLM inference techniques which we've adopted out-of-the-box
+- [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for performant LLM inference techniques which we've adopted out-of-the-box
 - [llama recipes](https://github.com/meta-llama/llama-recipes) for spring-boarding the llama2 community
 - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for bringing several memory and performance based techniques to the PyTorch ecosystem
 - [@winglian](https://github.com/winglian/) and [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) for early feedback and brainstorming on torchtune's design and feature set.
diff --git a/docs/source/tutorials/e2e_flow.rst b/docs/source/tutorials/e2e_flow.rst
@@ -341,7 +341,7 @@ these parameters.
 Introduce some quantization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+We rely on `torchao <https://github.com/pytorch/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
 To quantize the fine-tuned model after installing torchao we can run the following command::
 
   # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
diff --git a/docs/source/tutorials/llama3.rst b/docs/source/tutorials/llama3.rst
@@ -241,7 +241,7 @@ Running generation with our LoRA-finetuned model, we see the following output:
 Faster generation via quantization
 ----------------------------------
 
-We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+We rely on `torchao <https://github.com/pytorch/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
 To quantize the fine-tuned model after installing torchao we can run the following command::
 
   # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
diff --git a/docs/source/tutorials/qlora_finetune.rst b/docs/source/tutorials/qlora_finetune.rst
@@ -42,7 +42,7 @@ accuracy.
 
 The QLoRA authors introduce two key abstractions to decrease memory usage and avoid accuracy degradation: the bespoke 4-bit NormatFloat
 type, and a double quantization method that quantizes the quantization parameters themselves to save even more memory. torchtune uses
-the `NF4Tensor <https://github.com/pytorch-labs/ao/blob/b9beaf351e27133d189b57d6fa725b1a7824a457/torchao/dtypes/nf4tensor.py#L153>`_ abstraction from the `torchao library <https://github.com/pytorch-labs/ao>`_ to build QLoRA components as specified in the paper.
+the `NF4Tensor <https://github.com/pytorch/ao/blob/b9beaf351e27133d189b57d6fa725b1a7824a457/torchao/dtypes/nf4tensor.py#L153>`_ abstraction from the `torchao library <https://github.com/pytorch/ao>`_ to build QLoRA components as specified in the paper.
 torchao is a PyTorch-native library that allows you to quantize and prune your models.
 
 
@@ -275,7 +275,7 @@ As mentioned above, torchtune takes a dependency on torchao for some of the core
 
 The key changes on top of the LoRA layer are the usage of the ``to_nf4`` and ``linear_nf4`` APIs.
 
-``to_nf4`` accepts an unquantized (bf16 or fp32) tensor and produces an ``NF4`` representation of the weight. See the `implementation <https://github.com/pytorch-labs/ao/blob/c40358072f99b50cd7e58ec11e0e8d90440e3e25/torchao/dtypes/nf4tensor.py#L587>`_ of ``to_nf4`` for more details.
+``to_nf4`` accepts an unquantized (bf16 or fp32) tensor and produces an ``NF4`` representation of the weight. See the `implementation <https://github.com/pytorch/ao/blob/c40358072f99b50cd7e58ec11e0e8d90440e3e25/torchao/dtypes/nf4tensor.py#L587>`_ of ``to_nf4`` for more details.
 ``linear_nf4`` handles the forward pass and autograd when running with quantized base model weights. It computes the forward pass as a regular
 ``F.linear`` with the incoming activation and unquantized weight. The quantized weight is saved for backward, as opposed to the unquantized version of the weight, to avoid extra
-memory usage due to storing higher precision variables to compute gradients in the backward pass. See `linear_nf4 <https://github.com/pytorch-labs/ao/blob/main/torchao/dtypes/nf4tensor.py#L577>`_ for more details.
+memory usage due to storing higher precision variables to compute gradients in the backward pass. See `linear_nf4 <https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py#L577>`_ for more details.
diff --git a/recipes/eleuther_eval.py b/recipes/eleuther_eval.py
@@ -42,7 +42,7 @@
 
 class _VLMEvalWrapper(HFMultimodalLM):
     """An EvalWrapper for EleutherAI's eval harness based on gpt-fast's
-    EvalWrapper: https://github.com/pytorch-labs/gpt-fast/blob/main/eval.py.
+    EvalWrapper: https://github.com/meta-pytorch/gpt-fast/blob/main/eval.py.
 
     Note:
         This is ONLY for vision-language models.
@@ -283,7 +283,7 @@ def _model_multimodal_generate(
 
 class _LLMEvalWrapper(HFLM):
     """An EvalWrapper for EleutherAI's eval harness based on gpt-fast's
-    EvalWrapper: https://github.com/pytorch-labs/gpt-fast/blob/main/eval.py.
+    EvalWrapper: https://github.com/meta-pytorch/gpt-fast/blob/main/eval.py.
 
     Note:
         This is for text-only decoder models.
@@ -355,7 +355,7 @@ def tok_encode(self, text: str, **kwargs) -> list[int]:
         # +1% on truthfulqa_mc2 with a LoRA finetune. lit-gpt also sets this to False,
         # see https://github.com/Lightning-AI/lit-gpt/blob/main/eval/lm_eval_harness.py#L66,
         # though notably fast-gpt does the opposite
-        # https://github.com/pytorch-labs/gpt-fast/blob/main/eval.py#L123.
+        # https://github.com/meta-pytorch/gpt-fast/blob/main/eval.py#L123.
         if isinstance(self._tokenizer, HuggingFaceModelTokenizer):
             return self._tokenizer.base_tokenizer.encode(
                 text=text, add_bos=False, add_eos=False
diff --git a/torchtune/generation/_generation.py b/torchtune/generation/_generation.py
@@ -89,7 +89,7 @@ def generate_next_token(
         x (torch.Tensor): tensor with the token IDs associated with the given prompt,
             with shape [bsz x seq_length].
         q (Optional[torch.Tensor]): randomly sampled tensor for softmax sampling trick.
-            See https://github.com/pytorch-labs/gpt-fast/blob/32971d3129541c5bfb4f715abc33d1c5f408d204/generate.py#L40
+            See https://github.com/meta-pytorch/gpt-fast/blob/32971d3129541c5bfb4f715abc33d1c5f408d204/generate.py#L40
         mask (Optional[torch.Tensor]): attention mask with shape [bsz x seq_length x seq_length],
             default None.
         temperature (float): value to scale the predicted logits by, default 1.0.