Continuing my journey through The 9 Deep Learning Papers You Need to Know About, I read “Very Deep Convolutional Networks For Large-scale Image Recognition”, by Simonyan and Zisserman, 2015. In this paper, the authors conduct an experiment with very simple, deep neural networks. In the AlexNet and ZFNet architectures, the authors use an architecture where the first convolutional network has a larger filter than in the rest of the network. The authors of this paper instead opt to use only 3x3 filters with same padding and a stride of 1.
From what I can gather, the authors were able to achieve very competitive results on the ImageNet dataset using this simple and deep architecture for two main reasons. The first is that using small filter sizes and lots of depth imposes regularization. However, it’s important to use 3x3 filters as opposed to 1x1 filters because the slightly larger fields have a “non-trivial receptive field,” meaning that they actually capture what their neighbors are doing. However, the nonlinearity introduced in a 1x1 filter does help somewhat.
One thing that was useful for me about reading this paper was to see how they used simplicity in their network to make it easier to do see the results of their depth experiment. While other papers that I have read also conclude that depth in the model is very important, by reducing other factors and really focusing on depth here, and by making great strides in accuracy compared to the benchmark, you get an idea of just how important depth is in CNNs. And simplicity.