Vision Transformer — Understanding the underlying concept!

6 min readMay 3, 2021

Transformers are widely used in natural language processing (NLP) field. Because of its success in NLP, researcher tries to bring it to the computer vision world — aiming to removing convolutional-based network altogether.

However, images are not the same as words, therefore there should be some pre-processing part to make image to look like words, and therefore the title of the paper — “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”. The figure above should gives you an overview of how it works. We will look in details in following sections.

Take an image size of 224. To feed this image to a transformer, naively, every pixel need to attend to every other pixel, that’s why it is called attention! Mathematically, the attention will cost you 224⁴ every image! How is that possible with a decent computer? Therefore, global attention is needed. Let’s look how the architecture works!

Vision Transformer Architecture

Patch Embedding

A standard Transformer takes 1D sequence of token embedding as input. Therefore, for a 2D image, we need to reshape the image into a sequence of flattened 2D patches.

Let’s take an image size of 48 × 48, assuming we split into fixed-size patch of 16 × 16, therefore we will have 9 patches. Just take (48/16) × (48/16) and you will get 9 patches. Let’s illustrate that in picture below.

The image dimensions must be divisible by the patch size. Padding can be used to overcome the shortage of pixels.

Up to now, the image patches are still in 2D form (The patches are having size of 16×16×3, assuming an RGB image), it is vital to flatten into 1D. To solve this, the image patches are linearly projected into a vector using a learned embedding matrix E.

Now, to retain the positional information, we should pass in to the transformer encoder. To do that, the number of sequence of the image is passed to the encoder. They are just a learnable vector. Each of the positional embedding vector is parameterized and they will form a learnable positional embedding table.

Similar to BERT’s [cls] token, it is also preparn to the sequence of embedded patch. It is used to represent the aggregate of the representations of the patches. The state at the output of transformer encoder, z₀ will represents the image, y.

To summarize, in the first step before passing to the transformer encoder, we have to perform the following steps.

Image of size (H×W×C) is split into n patches of size (P×P×C).
The patches are flatten. The size of flattened patch has vector shape of (1×P²*C).
The flattened patches is multiplied with embedding tensor, E of shape (P² *C × d). The final embedded patches is now having shape of (1×d). d is the model dimension.
A cls token is prepended to the sequence of patch embeddings.
Positional embedding is added to the sequence.It learns the positional information for each of the patches.

All of the steps above are corresponded as equation below.

Transformer Encoder

After the embedding patch sequence is obtained, they are passed to the transformer encoder. A transformer encoder is made up of L identical layers, and in each layer, it has two main components, which are the multihead self-attention (MSA) block and multilayer perceptron (MLP) block. Layer normalization is applied between MSA and MLP, together with a residual connections after every block. The MLP consists of two fully-connected (FC) layers with a GeLU non-linearity activation in-between.

Multihead Self-Attention (MSA) Block

An attention function is the mapping of query and a set of key-value pairs to an output. Self-attention (SA) is an attention mechanism. Commonly, SA has query, Q, key, K and value, V as input. The output is the weighted sum of the value vectors, and weight is calculated by dot product of Q and K, divided by root of the dimension of key vector. Softmax function is then applied to calculate the weight. This calculation is based on the equation below.

Multihead Self-Attention (MSA) splits the input into h small parts, and calculates the dot-product individually parallelly. This is because projecting Q, Kand V multiple times can benefit the system as a whole. Thus, MSA can attend to information from different subspaces at different position.

Varient of Vision Transformer

ViT has three varient, namely Base, Large and Huge.

The Results

[Update on 05 May]

Train with FastAI + timm

1. Install required library. (I used fastai 2.2.3 because of the stability of accuracy, which drops at 2.2.7)

pip install timm fastai==2.2.3

2. Import library

from fastai.vision.all import *
import timm

3. Load images (Load from the folder with structure such that every subfolder is the class name, then every image is resized to 224)

dls = ImageDataLoaders.from_folder('/home/ubuntu/v2',train='train',valid='valid',bs=32,item_tfms=Resize(224,ResizeMethod.Squish),seed=42)

4. Build our ViT (remove last classifier and add in the number of output, for example 10 in this case. More model can be found here)

model = timm.create_model('vit_base_patch32_224', pretrained=True)
for param in model.parameters():
    param.requires_grad = False
outputs_attrs = 10
num_inputs = model.head.in_features
last_layer = nn.Linear(num_inputs, outputs_attrs)
model.head = last_layer

5. Start training

learn.fit_one_cycle(30,lr_max=0.003)

6. Adding the test dataset (optional) and show the classification report.

import os
filename,gt=[],[]
for j in range(10):
    for i in os.listdir(f'/home/ubuntu/v2/test/c{j}/'):
        if i.endswith('jpg'):
            filename.append(f'/home/ubuntu/v2/test/c{j}/{i}')
            gt.append(j)
test_dl = dls.test_dl(filename)
preds, y = learn.get_preds(dl=test_dl)predd = [torch.argmax(preds[i]).item() for i in range(len(preds))]
from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(gt,predd,digits=4))

References

[1] Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:181004805

[2] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner
T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image
is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv:201011929

[3] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN,
Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv preprint
arXiv:170603762