SPP (Spatial Pyramid Pooling) Layer
In Deep learning, “spatial dimensions” means “height and width”. Some networks need fixed input size for their fully connected layers. When does that happen?
- Models like VGGNet, AlexNet, and early versions of ResNet were designed with fixed input sizes in mind
- So previously, we need to crop/warp images.
So, to be able to use the dense layers with varying image sizes, we want to “downsize” feature maps to a fixed size. Spatial Pyramid Pooling Layer creates an pyramid of pooling results on feature maps. E.g.,
- Adjust pooling window size so we get a 4x4 pooled outputs from all 256-d input feature maps
- Adjust pooling window size so we get a 2x2 pooled outputs from all 256-d input feature maps
- Adjust pooling window size so we get a 1x1 pooled outputs from all 256-d input feature maps (global pooling)
- Flatten all pooled outputs, then concatenate them together into an 1-d vector
- The 1-d vector goes into the FC network as usual.
The greatest advantage of SPP is that it can preserve features on different granularity levels. This allows training with images of different sizes.
SPP as a concept is an extension of Bag-of-Words, it was considered in 2014 by He et al (MicroSoft) in;
Key Insights During Training
Here is an illustration of error rates for multiple models with SPP and multi-size training. As can be seen, SPP and multi-size training can additively increase the model accuracy.
Background Information
- Authors: Kaiming He et al. (Microsoft)
- Competitions: 2nd place ILSVRC Object Detection, 3rd place ILSVRC Classification
- Datasets:
Dataset | Task | Size | Partitioning |
---|---|---|---|
ImageNet 2012 | Classification | 1.2M+ | Train: 1.2M, Val: 50K, Test: 100K |
Pascal VOC 2007 | Detection | ~10K | Train/Val/Test |
Caltech101 | Classification | ~9K | Train: 30/category, Rest: Test |
- Comparisons with SOTA models (2014)
Model | Year | Classification Accuracy (ImageNet) | Detection mAP (Pascal VOC 2007) | Key Innovations |
---|---|---|---|---|
SPP-net | 2014 | 84.5% | 59.2% | Spatial pyramid pooling, No fixed-size input |
AlexNet | 2012 | 81.8% | 58.0% | First deep CNN on ImageNet |
VGG | 2014 | 86.8% | 66.3% | Deep, uniform 3x3 convolutional layers |
GoogLeNet (Inception) | 2014 | 89.2% | 62.1% | Inception modules, deeper network |
R-CNN | 2014 | - | 53.3% | Region proposals, CNN for detection |
Overfeat | 2014 | 81.9% | 53.9% | Multi-scale sliding windows |
Ablation Study
An ablation study in deep learning is to selectively “ablate” (remove) a part of a model, e.g., a layer, an activation function, to study the impact on the overall performance.