Deep Learning - MobilenetV2 Hands-On - Rico贾若童的博客

MobileNet Architectures

In MobileNet V1, the architecture is quite simple: it’s first a conv layer, then followed by 13 sets of depthwise-separable layers. Finally, it has an avg pool, FC, and a softmax layer. In total there are 1000 classes.

In MobileNet V2 (Sandler et al., 2018), the biggest difference is the introduction of the “inverted bottleneck block”. The interverted bottleneck block adds a skip connection at the beginning of the next bottleneck block. Additionally, an expansion and projection are added in the bottleneck block. MobileNet V2 has 155 layers.

In a bottleneck block,

dimensions are jacked up so the network can learn a richer function.
perform depthwise convolutions
they are projected down before moving the next bottleneck block so memory usage won’t be too big, and model size can be smaller

Overall, MobileNet V2 has 155 layers (3.5M params). There are 16 inverted residual (bottlenck) blocks. Some blocks have skip connections to the block after the next one. Usually, in each bottleneck block, after an expansion and a depthwise convolution, there is a batch normalization and ReLu. After a downsizing projection, there is a batch normalization.

Data

We are using the COCO dataset to train a multi-class classifier with a MobileNetV2. The dataloading code can be found here
A common way to frame this problem is to convert it into a binary classification problem. The way to do is to create a multi-hot vector where a class is 1.0 if it’s in the class list

(class_1_probability, class_2_probability ...)
[1.0, 0.5, 1.0, ...]

During training, we will calculate Binary Cross Entropy loss on this multi-hot vector. See the Model section for more details.

Model

In this implementation, we added dilated convolution as well to increase the receptive field range.
Initialization method is conventional. We are doing:
- Conv 2D layers:
  - He initialization on weight matrices
  - Zero initialization on biases
- Linear layers:
  - Normal initialization with mean=0, std_dev=1.0
  - Zero initialization on biases
- Batch norm layers:
  - Zero initialization on mean,
  - One initialization on std_dev
ReLu6 is more robust in low-precision training?
It’s generally a good idea to use in_place=True to avoid out of memory error.
MobileNet V2 uses dilated convolution to increase receptive fields
torch.nn.Conv2d(groups):
- At groups=1, all inputs are convolved to all outputs.
- At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.
- At groups= in_channels, each input channel is convolved with its own set of filters.
FP16 & float32 mixed precesion training was also used.
At the end of the model, we have one dense layer as a “classifier”.
- The classifier does NOT have sigmoid at the end. This is trained for multi-label classification. There should be a sigmoid (not softmax) layer following it, but it is handled in the nn.BCEWithLogitsLoss.
- The classifier dim is the same as number of classes in the dataset. So if we train across multiple datasets, we have to map the datasets correctly.
Model Summary: 141 layers and 2.3M parameters. This is a modified version from the original model.

Total params: 2,326,352
Trainable params: 2,326,352
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 1890.44
Params size (MB): 8.87
Estimated Total Size (MB): 1900.06

Training Adjustments & Iterations

[1] Training with the `nn.BCELossWithLogits()` loss

When considering accuracy on both positives and negatives, accuracies are: training set 97.97%, validation set 97.73%. But I realized that we have way too many negatives. The f1 score were: training set 0.654, validation set 0.617.

Observations: I don’t see overfitting or underfitting.
Actions
- Check for better loss options
  - f1 score (Not suitable because it’s not differentiable) TODO: embed link here
    - When calculating f1 score, we need to count true positives anf1 score itself is not differentiable, unless we track
  - Hamming loss (Not suitable because it’s not differentiable) TODO: embed link here
  - Focal loss: embed link here
  - Soft F1 Loss: embed link here
  - F1 Metric Variant
- Check for training images. Some images are dark after normalization with ImageNet’s mean and std_dev
  - TODO
- Fine tuning:
  - freeze early layers. early layers capture local features. Freezing the late layers might be beneficial
- Check for class imbalances

Deep Learning - MobilenetV2 Hands-On

Inverted Skip Connection, Multiclass Classification

MobileNet Architectures

Data

Model

Training Adjustments & Iterations

[1] Training with the `nn.BCELossWithLogits()` loss

CATALOG

FEATURED TAGS

FRIENDS

MobileNet Architectures

Data

Model

Training Adjustments & Iterations

[1] Training with the nn.BCELossWithLogits() loss

CATALOG

FEATURED TAGS

FRIENDS

[1] Training with the `nn.BCELossWithLogits()` loss