Deep Learning - Spatial Pyramid Pooling - Rico贾若童的博客

SPP (Spatial Pyramid Pooling) Layer

In Deep learning, “spatial dimensions” means “height and width”. Some networks need fixed input size for their fully connected layers. When does that happen?

Models like VGGNet, AlexNet, and early versions of ResNet were designed with fixed input sizes in mind
So previously, we need to crop/warp images.

So, to be able to use the dense layers with varying image sizes, we want to “downsize” feature maps to a fixed size. Spatial Pyramid Pooling Layer creates an pyramid of pooling results on feature maps. E.g.,

Adjust pooling window size so we get a 4x4 pooled outputs from all 256-d input feature maps
Adjust pooling window size so we get a 2x2 pooled outputs from all 256-d input feature maps
Adjust pooling window size so we get a 1x1 pooled outputs from all 256-d input feature maps (global pooling)
Flatten all pooled outputs, then concatenate them together into an 1-d vector
The 1-d vector goes into the FC network as usual.

The greatest advantage of SPP is that it can preserve features on different granularity levels. This allows training with images of different sizes.

SPP as a concept is an extension of Bag-of-Words, it was considered in 2014 by He et al (MicroSoft) in;

Key Insights During Training

Here is an illustration of error rates for multiple models with SPP and multi-size training. As can be seen, SPP and multi-size training can additively increase the model accuracy.

Background Information

Authors: Kaiming He et al. (Microsoft)
Competitions: 2nd place ILSVRC Object Detection, 3rd place ILSVRC Classification
Datasets:

Dataset	Task	Size	Partitioning
ImageNet 2012	Classification	1.2M+	Train: 1.2M, Val: 50K, Test: 100K
Pascal VOC 2007	Detection	~10K	Train/Val/Test
Caltech101	Classification	~9K	Train: 30/category, Rest: Test

Comparisons with SOTA models (2014)

Model	Year	Classification Accuracy (ImageNet)	Detection mAP (Pascal VOC 2007)	Key Innovations
SPP-net	2014	84.5%	59.2%	Spatial pyramid pooling, No fixed-size input
AlexNet	2012	81.8%	58.0%	First deep CNN on ImageNet
VGG	2014	86.8%	66.3%	Deep, uniform 3x3 convolutional layers
GoogLeNet (Inception)	2014	89.2%	62.1%	Inception modules, deeper network
R-CNN	2014	-	53.3%	Region proposals, CNN for detection
Overfeat	2014	81.9%	53.9%	Multi-scale sliding windows

Ablation Study

An ablation study in deep learning is to selectively “ablate” (remove) a part of a model, e.g., a layer, an activation function, to study the impact on the overall performance.

References

[1] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

Deep Learning - Spatial Pyramid Pooling

SPP, Ablation Study

SPP (Spatial Pyramid Pooling) Layer

Key Insights During Training

Background Information

Ablation Study

References

CATALOG

FEATURED TAGS

FRIENDS