-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Description
Hi,
In standard tensor parallelism, we typically have:
Attention → Output Projection → All-Reduce → LayerNorm → FFN → All-Reduce → LayerNorm.
However, in the paper, you use GEMV.AG and O.AG. I didn't fully understand this part. Could you please briefly explain the math behind it and why you use 2 AG + 1 AR instead of the conventional 2 AR?
Thanks
Metadata
Metadata
Assignees
Labels
No labels