Don't Use Dropout To Regularize Convolutional Layers

After getting terrible results when I added dropout regularization to my model for classifying heart attacks from ECGs, I found this blog post, which has a similar title to this article. The article says Dropout should be used for fully connected layers only.  That conforms with the models used in VGGNet, ZFNet, and GoogLeNet, which only use dropout in the final layers of the model.   The blog post says batch normalization is a better bet. Batch Norm transforms layer outputs into a Gaussian distribution with mean 0 and standard deviation of 1.  Here’s a good article that talks about Batch Normalization.

I wonder if using dropout was part of what was the strange learning curves that I got during the dropout experiment that I did, where I tried dropout fractions of 0.01, .33, and .67 for inception networks with 2 and 3 layers and with 16 and 32 filters.  Some of the models just overfit horribly:

-----------------
hyperparameters: 
	batch_size: 50
	epochs: 25
	dropout_rate: 0.01
	learning_rate: 0.01
	num_layers: 4
	num_filters: 4
Image.jpg
Image (1).jpg
Image (2).jpg


Others had this strange shape, which I suspect is due to a vanished gradient.  Basically, there were some big activations in the first epoch, and the network was not able to find a gradient in the final sigmoid activation layer.

hyperparameters: 
	batch_size: 50
	epochs: 25
	dropout_rate: 0.33
	learning_rate: 0.01
	num_layers: 2
	num_filters: 4
Image (3).jpg
Image (5).jpg

I am currently working on a network where I use BatchNormalization before the activations in each of the convolutional layers of the model.  I chose to do the normalization before the activations after reading this stack overflow post.

I know that I shouldn’t change multiple things at once, but after reading the GoogLeNet paper, I saw that the authors use a MaxPooling layer between each inception module, and that this reduces the width of their convolution volumes.  This will allow me to train deeper networks and may help with overfitting.