Regarding GEMV.AG and O.AG

Hi,

In standard tensor parallelism, we typically have:
Attention → Output Projection → All-Reduce → LayerNorm → FFN → All-Reduce → LayerNorm.

However, in the paper, you use GEMV.AG and O.AG. I didn't fully understand this part. Could you please briefly explain the math behind it and why you use 2 AG + 1 AR instead of the conventional 2 AR?

Thanks