Motivation
UNet has restricted receptive fields. It is sufficient for identifying local areas such as tumors (in medical images). When we learn larger image patches, UNet was not sufficient.
DeepLab V1 & V2 (2016) [1]: they were reviewed together as they both use Atrous convolution, or “dilated convolution” (空洞卷积), and Fully Connected Conditional Random Field. V1 was only using VGG16, while V2 was using VGG16 and ResNet. DeepLab V3 (2017) [2] DeepLab v3+ (2018) - Achieved SOTA on Pascal-VOC 2012.
The Use of Dilated Convolution
-
Some notes about Dilated Convolution is here
- An interesting question is: why don’t we use a larger kernel? Since VGG networks (2014), using a 3x3 kernel has become a trend. That’s because:
- A
7x7
is more than 5 times larger than a3x3
kernel. - A
3x3
introduces more non-linearity than a7x7
, which allows the network to learn more complex landscapes.
- A
- Fun fact: after 2014,
3x3
seems to become the most popular kernel size
ASPP
TODO