Opening Thought: If this were to dominate CV, which means the evolution in CV field is MLP -> CNN -> Transformer -> MLP? Back to square one?
MLP-Mixer is introduced by the same team that introduces Vision Transformer (ViT).
Model Overview
MLP-Mixer is a pure MLP architecture (duh?). First, we split the image into patches. Then, we convert the patch to feature embedding through an FC layer. Following that, we send the embedding to N x Mixer Layer. Finally, we classify the output through another FC layer. Simple, right?
Mixer Architecture
Mixer can be divided into channel-mixing MLP (Green box) and token-mixing MLP (Orange box).
- Channel-mixing MLPs allow communication between different channels.
- Token-mixing MLPs allow communication between different spatial locations (tokens).
These two types of layers are interleaved to enable the interaction of both input dimensions.
Each MLP is made up of two FC layer with GELU in between them.
Varients of MLP-Mixer
Results
The author compared the two largest configurations of MLP-mixer and compared them with the SOTA. They achieve almost similar performance. However, MLP-Mixer performance dropped if the training dataset is smaller.
It is observed that MLP-Mixer and ViT has similar transfer accuracy, throughput, which is better than ResNet.
Codes
Available over here or timm. Other implementations in Pytorch includes here and here.
References
- “MLP-Mixer: An all-MLP Architecture for Vision” https://arxiv.org/pdf/2105.01601.pdf
- “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” https://arxiv.org/pdf/2010.11929
- https://github.com/google-research/vision_transformer
- https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py
- https://github.com/lucidrains/mlp-mixer-pytorch
- https://github.com/rishikksh20/MLP-Mixer-pytorch
Make MLP great again?