pepperd — papers, but you'll actually finish them

§1brainrot tldr

someone looked at RNNs, said "what if we just didn't", and accidentally invented the architecture that runs everything now. attention mechanisms replace recurrence entirely — no sequential processing, no vanishing gradients, no waiting for token N-1 before you can think about token N. you just... look at everything at once. it works embarrassingly well.

§2key findings

—self-attention computes relationships between all positions simultaneously — no sequential dependency, no recurrence
—multi-head attention lets the model attend to different representation subspaces at once — like having multiple perspectives on the same sentence
—positional encodings inject sequence order since attention itself is order-agnostic (it sees a set, not a sequence)
—the transformer outperformed RNN-based seq2seq models on English-to-German and English-to-French translation
—training was significantly faster than recurrent models because everything can be parallelised
—the architecture generalised beyond translation — this is now the backbone of basically everything in NLP

§3interactive visual

figure 1 — self-attention weights

click a token to see how much it attends to every other token. brighter = higher attention weight.

input sequence:

↑ select a token above

full attention matrix (all heads averaged):

	The	cat	sat	on	the	mat
The	0.60	0.10	0.05	0.05	0.15	0.05
cat	0.10	0.50	0.20	0.05	0.05	0.10
sat	0.05	0.30	0.40	0.10	0.05	0.10
on	0.05	0.05	0.10	0.50	0.20	0.10
the	0.20	0.05	0.05	0.10	0.50	0.10
mat	0.05	0.10	0.10	0.10	0.15	0.50

row = query token, col = key token

§4comprehension check

peer review quiz

[REVIEWER 2 DEMANDS YOU ANSWER THESE]

question 1

what does self-attention replace in the transformer architecture?

question 2

why do transformers need positional encodings?

question 3

what does 'multi-head' in multi-head attention mean?

question 4

the paper's main claim is that attention mechanisms alone are sufficient for sequence modelling. what evidence supports this?