Transformers
Last updated
Last updated
Get a non-technical introduction to attention with 3B1B with these two videos: , .
Go through an intuitive visualization:
Progress Check:
Be able to answer all of
Understand the differences between:
encoder and decoder, self-attention vs cross attention, layernorm and batchnorm
Search up more about residual connections to get more intuition for why they actually work. You might be the kind of person who'll get it right away, but for me it felt unintuitive for a while.
One somewhat correct intuition: Because of the way that gradients flow (recall that + is a "distributor"), the network will learn to have the new layer as edits to the last layer rather than creating an entirely new representation.
2 good links to dive deeper on actual issues with gradients that skip connections help avoid:
Explain attention from the perspective of two concurrent tokens a and b where a = "The", b = "dog", and b is attending to a.
Answer: There will be a high affinity score for the "look one position behind" query for token b. After matrix multiplication with the value matrix, token b ("dog") will have the added context that can be thought of as "I follow 'The' " (because the part of V that points to "I am one position after..." will match up with the query that says "look one position behind..."). So token b will now have the added context that it follows token a. In other words, "dog" will have the added context that it follows "The".
On the left, you'll see other sections that dive deeper into mechanistic interpretability, which at a high level can be thought of as "neuroscience on neural nets." If you find that interesting I would highly recommend checking some of those chapters out, but they're not strictly necessary.
Exercises:
Do some of the Karpathy transformer exercises if you haven't already. I've implemented the GPT calculator .