This is a reproduction of the model used in Malach, E., 2023. Auto-regressive next-token predictors are universal learners. arXiv preprint arXiv:2309.06979, Section 4.1.
You can also find this in Colab.
The model is:
- An embedding layer from the token space to a d-dimensional vector space.
- A linear layer of size (dT)x(dT), where T is the context window size.
- A linear layer mapping to logits over the token space.
I also added a LayerNorm in there after the second layer.
This toy model is a good exercise in understanding next token prediction as it does not require positional encodings, self-attention, etc.