Attention

The metaphor You're looking for mentions of your new product WhirlyGigs in a review. Your eyes quickly skim the page, resting only when the familiar pattern of letters is found. When you score a match, you quickly process the surrounding context and adjust your judgement of the review.

Transformer multi head self attention defines multiple heads. Each head constructs query vectors. It scores each query against a key vectors for each position in a sequence. These are compared and normalised to give importance scores for each position. The importance scores are used in a weighted sum of values, also generated for each position.

Image generated by Stable Diffusion 2.1, prompt: "A large flock of colorful sheep. Photo. HDR."

The good The heart of the connection to the metaphor is the attention map, a normalised weighting function over tokens in a sequence or blocks in an image. It is easy to visualise and amenable to tweaking, and generally behaves as we'd expect. Helpfully, both axes on an attention map are grounded in the input/output of the model. Most importantly, it is consistent with the metaphor—high-weight regions are typically the relevant bits & they do tell us what the model is looking at.

The bad First, there are a lot of attention maps to choose from. GPT-3 has 96 layers $\times$ 96 heads per layer, a whopping 9216 attention heads in total, each defining their own attention map for any given input. This puts attention-based interpretation in a risky place—if we look hard enough there's probably an attention map to support a new theory about what is happening for a given input.

Another limitation of attention maps is that they tell us nothing about the nature of the interaction between multiple heads, or the values being mixed by the weighted map. Transformer models often look at the same position multiple times, either in parallel heads in the same layer or across layers. In metaphor-land, it's not clear why you'd need to look multiple times.

Conclusion Attention is a good name. Attention maps are useful tools. They are certainly at risk of over-interpretation, and put you at risk of confirmation bias. As ever—enjoy the metaphor, use it with care.

Introduction
A Neural Network
The Loss Landscape
Attention
Mixture of Experts
Finally, a conclusion, thanks for reading!