back to pepperd
NeurIPS 2017

Attention Is All You Need

Vaswani et al.·2017·~5 min·arxiv ↗
main charactertransformersattentionMLNLP
§1brainrot tldr

someone looked at RNNs, said "what if we just didn't", and accidentally invented the architecture that runs everything now. attention mechanisms replace recurrence entirely — no sequential processing, no vanishing gradients, no waiting for token N-1 before you can think about token N. you just... look at everything at once. it works embarrassingly well.

§2key findings
  • self-attention computes relationships between all positions simultaneously — no sequential dependency, no recurrence
  • multi-head attention lets the model attend to different representation subspaces at once — like having multiple perspectives on the same sentence
  • positional encodings inject sequence order since attention itself is order-agnostic (it sees a set, not a sequence)
  • the transformer outperformed RNN-based seq2seq models on English-to-German and English-to-French translation
  • training was significantly faster than recurrent models because everything can be parallelised
  • the architecture generalised beyond translation — this is now the backbone of basically everything in NLP
§3interactive visual

figure 1 — self-attention weights

click a token to see how much it attends to every other token. brighter = higher attention weight.

input sequence:
↑ select a token above
full attention matrix (all heads averaged):
Thecatsatonthemat
The0.600.100.050.050.150.05
cat0.100.500.200.050.050.10
sat0.050.300.400.100.050.10
on0.050.050.100.500.200.10
the0.200.050.050.100.500.10
mat0.050.100.100.100.150.50
row = query token, col = key token
§4comprehension check

peer review quiz

[REVIEWER 2 DEMANDS YOU ANSWER THESE]

question 1

what does self-attention replace in the transformer architecture?

question 2

why do transformers need positional encodings?

question 3

what does 'multi-head' in multi-head attention mean?

question 4

the paper's main claim is that attention mechanisms alone are sufficient for sequence modelling. what evidence supports this?