Deep Learning - MobilenetV2 Hands-On

Inverted Skip Connection, Multiclass Classification

Posted by Rico's Nerd Cluster on August 1, 2022

MobileNet Architectures

In MobileNet V1, the architecture is quite simple: it’s first a conv layer, then followed by 13 sets of depthwise-separable layers. Finally, it has an avg pool, FC, and a softmax layer. In total there are 1000 classes.

In MobileNet V2 (Sandler et al., 2018), the biggest difference is the introduction of the “inverted bottleneck block”. The interverted bottleneck block adds a skip connection at the beginning of the next bottleneck block. Additionally, an expansion and projection are added in the bottleneck block. MobileNet V2 has 155 layers.

In a bottleneck block,

  1. dimensions are jacked up so the network can learn a richer function.
  2. perform depthwise convolutions
  3. they are projected down before moving the next bottleneck block so memory usage won’t be too big, and model size can be smaller

Overall, MobileNet V2 has 155 layers (3.5M params). There are 16 inverted residual (bottlenck) blocks. Some blocks have skip connections to the block after the next one. Usually, in each bottleneck block, after an expansion and a depthwise convolution, there is a batch normalization and ReLu. After a downsizing projection, there is a batch normalization.


  • We are using the COCO dataset to train a multi-class classifier with a MobileNetV2. The dataloading code can be found here

  • A common way to frame this problem is to convert it into a binary classification problem. The way to do is to create a multi-hot vector where a class is 1.0 if it’s in the class list

(class_1_probability, class_2_probability ...)
[1.0, 0.5, 1.0, ...]

During training, we will calculate Binary Cross Entropy loss on this multi-hot vector. See the Model section for more details.


  • In this implementation, we added dilated convolution as well to increase the receptive field range.

  • Initialization method is conventional. We are doing:
    • Conv 2D layers:
      • He initialization on weight matrices
      • Zero initialization on biases
    • Linear layers:
      • Normal initialization with mean=0, std_dev=1.0
      • Zero initialization on biases
    • Batch norm layers:
      • Zero initialization on mean,
      • One initialization on std_dev
  • ReLu6 is more robust in low-precision training?
  • It’s generally a good idea to use in_place=True to avoid out of memory error.
  • MobileNet V2 uses dilated convolution to increase receptive fields
  • torch.nn.Conv2d(groups):
    • At groups=1, all inputs are convolved to all outputs.
    • At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.
    • At groups= in_channels, each input channel is convolved with its own set of filters.
  • FP16 & float32 mixed precesion training was also used.

  • At the end of the model, we have one dense layer as a “classifier”.
    • The classifier does NOT have sigmoid at the end. This is trained for multi-label classification. There should be a sigmoid (not softmax) layer following it, but it is handled in the nn.BCEWithLogitsLoss.
    • The classifier dim is the same as number of classes in the dataset. So if we train across multiple datasets, we have to map the datasets correctly.
  • Model Summary: 141 layers and 2.3M parameters. This is a modified version from the original model.
Total params: 2,326,352
Trainable params: 2,326,352
Non-trainable params: 0
Input size (MB): 0.75
Forward/backward pass size (MB): 1890.44
Params size (MB): 8.87
Estimated Total Size (MB): 1900.06

Training Adjustments & Iterations

[1] Training with the nn.BCELossWithLogits() loss

When considering accuracy on both positives and negatives, accuracies are: training set 97.97%, validation set 97.73%. But I realized that we have way too many negatives. The f1 score were: training set 0.654, validation set 0.617.

  • Observations: I don’t see overfitting or underfitting.
  • Actions

[2] Training Model From [1] With Focal Loss

  • Observations: there is clear overfitting on the training set.
    • training set: F1 score: 60.7843137254902, precision: 91.17647058823529, recall: 45.588235294117645
    • validation set: F1 score: 59.830667920978364, precision: 86.81747269890796, recall: 45.642813204839044

[3] Retrain From Scratch with F1 Metric

  • F1 score: 55.68627450980392, precision: 89.30817610062893, recall: 40.45584045584046
  • F1 score: 52.83037094281298, precision: 90.05433887699654, recall: 37.37953660036908

  • Actions:
    • Retrain without Normalization
    • Or continue training but with a different loss function that focuses on recall?