Mixture of Experts

The metaphor You're managing an editorial team at a newspaper. Each afternoon a large pile of stories is dumped on your desk (each a joint effort between LLM and prompter). You could get your team to each read every single one, and balance their opinions to process each manuscript. But that'd be slow, so you quickly skim each to categorise it, then pop it in the inbox of the sports/political/technology editor. Thanks to their expertise, you know they can do an accurate job quickly.

Mixture-of-Expert (MoE) layers classify incoming tokens, assigning each to an expert layer. Tokens are routed to their chosen layer, which processes them independently from the others. The outputs from each expert layer are routed back for further processing.

"The editors", a poster for an upcoming movie, showing the editorial team. Image generated by Stable Diffusion 2.1, prompt: ""The editors", a poster for an upcoming movie, showing the editorial team."

The good The metaphor has a good correspondence: editor & classifier, internal mail & routing, domain-editor & expert layer, domain experience & expert training data. It gives some intuition for what the token-expert classifier layer might be doing. And it encourages us to think about the subset of the training data that each expert sees.

The bad Thinking of each routed layer as an expert is risky. In the metaphor, an expert's subject area is grounded and identifiable, but in the standard MoE example, they are not. In other words, experts define clusters, not classes.

So we'll probably get tied up in knots if we worry about what a given layer's "expertise" really is. It might not be anything we can identify as a legitimate speciality. Perhaps it's not a MoE layer at all, it's just a MaRoS (mapping a region of space) layer. OK, I just made that up (and it's not very good). But you get the idea—it's possible to move away from the metaphor into something more abstract.

If it's wrong, fine, but what's the danger? As with all shaky metaphors, it could mean we waste ourselves on bad ideas and dismiss good ones. Perhaps we think the assignment layer should be guided in order to do its job properly, based on identifiable classes. I'm sceptical. We think a stable random assignment could never work? See Hash Layers.

Conclusion It's a tempting metaphor, and seems quite good at first glance. But "experts" is a dangerously powerful term, and can bring along lots of baggage. I'll come down quite strongly on this one—I'd rather leave the layers abstract; so maybe I will still "MoE" for now, but I'll quietly try to forget what the "E" stands for.