MLP-Mixer: An all-MLP Architecture for Vision

2 min readMay 10, 2021

Opening Thought: If this were to dominate CV, which means the evolution in CV field is MLP -> CNN -> Transformer -> MLP? Back to square one?

MLP-Mixer is introduced by the same team that introduces Vision Transformer (ViT).

Model Overview

MLP-Mixer is a pure MLP architecture (duh?). First, we split the image into patches. Then, we convert the patch to feature embedding through an FC layer. Following that, we send the embedding to N x Mixer Layer. Finally, we classify the output through another FC layer. Simple, right?

Mixer Architecture

Mixer can be divided into channel-mixing MLP (Green box) and token-mixing MLP (Orange box).

Channel-mixing MLPs allow communication between different channels.
Token-mixing MLPs allow communication between different spatial locations (tokens).

These two types of layers are interleaved to enable the interaction of both input dimensions.

Each MLP is made up of two FC layer with GELU in between them.

Varients of MLP-Mixer

Results

The author compared the two largest configurations of MLP-mixer and compared them with the SOTA. They achieve almost similar performance. However, MLP-Mixer performance dropped if the training dataset is smaller.