someone looked at RNNs, said "what if we just didn't", and accidentally invented the architecture that runs everything now. attention mechanisms replace recurrence entirely — no sequential processing, no vanishing gradients, no waiting for token N-1 before you can think about token N. you just... look at everything at once. it works embarrassingly well.
click a token to see how much it attends to every other token. brighter = higher attention weight.
| The | cat | sat | on | the | mat | |
|---|---|---|---|---|---|---|
| The | 0.60 | 0.10 | 0.05 | 0.05 | 0.15 | 0.05 |
| cat | 0.10 | 0.50 | 0.20 | 0.05 | 0.05 | 0.10 |
| sat | 0.05 | 0.30 | 0.40 | 0.10 | 0.05 | 0.10 |
| on | 0.05 | 0.05 | 0.10 | 0.50 | 0.20 | 0.10 |
| the | 0.20 | 0.05 | 0.05 | 0.10 | 0.50 | 0.10 |
| mat | 0.05 | 0.10 | 0.10 | 0.10 | 0.15 | 0.50 |
[REVIEWER 2 DEMANDS YOU ANSWER THESE]
what does self-attention replace in the transformer architecture?
why do transformers need positional encodings?
what does 'multi-head' in multi-head attention mean?
the paper's main claim is that attention mechanisms alone are sufficient for sequence modelling. what evidence supports this?