Don't Use Dropout To Regularize Convolutional Layers

After getting terrible results when I added dropout regularization to my model for classifying heart attacks from ECGs, I found this blog post, which has a similar title to this article. The article says Dropout should be used for fully connected layers only.  That conforms with the models used in VGGNet, ZFNet, and GoogLeNet, which only use dropout in the final layers of the model.   The blog post says batch normalization is a better bet. Batch Norm transforms layer outputs into a Gaussian distribution with mean 0 and standard deviation of 1.  Here’s a good article that talks about Batch Normalization.

I wonder if using dropout was part of what was the strange learning curves that I got during the dropout experiment that I did, where I tried dropout fractions of 0.01, .33, and .67 for inception networks with 2 and 3 layers and with 16 and 32 filters.  Some of the models just overfit horribly:

	batch_size: 50
	epochs: 25
	dropout_rate: 0.01
	learning_rate: 0.01
	num_layers: 4
	num_filters: 4
Image (1).jpg
Image (2).jpg

Others had this strange shape, which I suspect is due to a vanished gradient.  Basically, there were some big activations in the first epoch, and the network was not able to find a gradient in the final sigmoid activation layer.

	batch_size: 50
	epochs: 25
	dropout_rate: 0.33
	learning_rate: 0.01
	num_layers: 2
	num_filters: 4
Image (3).jpg
Image (5).jpg

I am currently working on a network where I use BatchNormalization before the activations in each of the convolutional layers of the model.  I chose to do the normalization before the activations after reading this stack overflow post.

I know that I shouldn’t change multiple things at once, but after reading the GoogLeNet paper, I saw that the authors use a MaxPooling layer between each inception module, and that this reduces the width of their convolution volumes.  This will allow me to train deeper networks and may help with overfitting.

Reading Classic Neural Network Papers: GoogleNet Paper

Going Deeper With Convolutions,” by Szegedy et. al. (2015) is a great read, because it is a very thoughtful architecture.  Reading about it and thinking about it really gives you a new perspective on neural network architectures, because it provides a new way to combine layers and the outputs of layers.  The main idea is that, in deeper layers of a neural network, it might be helpful to use filters bigger than 3x3 because the neurons in those layers are covering more of the image in the abstractions that they are making.

There were a couple other notable observations in this paper for me.  One was that the authors use 1x1 convolutions for dimensionality reduction (and increasing nonlinearity as a side effect).  Here is a great blog post about 1x1 convolutional networks.

Another was that the researchers in this paper used a single 7x7 average pooling layer instead of a stack of fully connected layers before generating the output.  They link to the “Network in Network” paper, by M. Lin, Q. Chen, and S. Yan. In 2013 to justify this replacement, which I have only read the abstract of, but which looks interesting.  Supposedly average pooling makes the model more interpretable and less prone to overfitting.  

While there are several other observations I could make, I want to note one that is most relevant to my work on the heart attack detection from ecg project.  After each inception module, the researchers use a max-pooling layer which cuts dimensionality by half.  This is very important, because the inception module requires same padding for each convolution inside of the inception module, so you can end up with wayyy too many parameters if you aren’t cutting the width of your data at each stage. (1x1 convolutions would allow you to modify the depth).

My overall conclusion is that this paper flings open the door to novel ways of combining layers in parallel as well as in stacks.  While I am experimenting with inception modules in my own work, it’s exciting to look forward to things like LSTMs and ResNets, which also combine layers and outputs of layers using novel logics with novel hypotheses.

Reading Classic Deep Learning Papers: VGGNet

Continuing my journey through The 9 Deep Learning Papers You Need to Know About, I read “Very Deep Convolutional Networks For Large-scale Image Recognition”, by Simonyan and Zisserman, 2015.  In this paper, the authors conduct an experiment with very simple, deep neural networks.  In the AlexNet and ZFNet architectures, the authors use an architecture where the first convolutional network has a larger filter than in the rest of the network.  The authors of this paper instead opt to use only 3x3 filters with same padding and a stride of 1.  

From what I can gather, the authors were able to achieve very competitive results on the ImageNet dataset using this simple and deep architecture for two main reasons.  The first is that using small filter sizes and lots of depth imposes regularization.  However, it’s important to use 3x3 filters as opposed to 1x1 filters because the slightly larger fields have a “non-trivial receptive field,” meaning that they actually capture what their neighbors are doing.  However, the nonlinearity introduced in a 1x1 filter does help somewhat.

One thing that was useful for me about reading this paper was to see how they used simplicity in their network to make it easier to do see the results of their depth experiment.  While other papers that I have read also conclude that depth in the model is very important, by reducing other factors and really focusing on depth here, and by making great strides in accuracy compared to the benchmark, you get an idea of just how important depth is in CNNs.  And simplicity.

Reading Classic Deep learning papers: ZFNet paper

After reading all three articles in this fantastic introduction to convolutional neural networks, I decided to read all of the papers mentioned in the third article of the series: “The 9 Deep Learning Papers You Need To Know About

The paper I read today is called: “Visualizing and Understanding Convolutional Networks.”

Yesterday I read the AlexNet Paper, “ImageNet Classification with Deep Convolutional Neural Network,” which this paper builds upon. The authors of the ZFNet paper dive into the inner working of the convolutional neural network that the AlexNet paper describes, and through diving in and understanding the inner working of the AlexNet model, they were able to tease out some ways to make improvements to it.

Overall, this paper shows several inquiries into the fundamental inner workings of a convolutional neural network.  For example, they are able to visualize what’s going on inside the filter of a hidden layer. They found that each layer of the convnet makes increasingly higher level observations about the image.  Using the same tool, they were able to visualize the way that different layers change during training.  They also experimented with seeing how rotating, scaling, and translating the same image affects the feature layers produced at different layers, and they experimented with blocking out certain parts of the image to see how that affects the activations of hidden layers and output probabilities.  This paper is a good read for understanding and visualizing the way that layers in a convolutional neural network compute correspondence between specific object parts in the layer below.

Establishing A Baseline For Using Convolutional Neural Networks Classifying Heart Attacks From ECG Data

I recently read a paper in which the researchers were able to achieve up to ~90% sensitivity and specificity using a convolutional neural network. I tried to replicate their research, and I did not do as well. Despite using a similar architecture to the one presented in the paper, my models overfit big time.  The one thing I am aware of in their paper that I did not do is downsample my data, largely because I could not parse the method they used to smooth and reduce the size of their data. They linked to another paper for directions on downsampling the ekg data, but I am not a member of an academic institution, so I could not get access.

In the model that I am working on, I use layered inception modules like those used in the GoogLeNet Paper. I grid searched through models with every combination of the following:

  • layers: 1, 2, 4

  • Number of filters in tower of each layer: 30, 15

I trained each of these 6 models using a batch size of 50 and a learning rate of 0.001.  

The models with more than one layer all overfit to the training set.  For example, the model with the following parameters reached around 90% accuracy on the training set, but barely better than random on the validation set.

  • number of filters: 15

  • number of layers: 2


The single layer models did not overfit as much.  They still overfit though.  Here are the parameters and results for the best of these single layer models.

  • number of filters: 30

  • number of layers: 1


To improve on this work, I need to work hard on getting the overfitting under control.  Here are some options for doing that:

  • Augment my training dataset.  I could do that by slicing all of my samples in half again so that they are each 5 seconds long.  I could probably do that twice if necessary.  So long as I am only using a small portion (.15) of that data for validation, that would more than double the size of my training set for each round of augmentation.  I could also horizontally translate my data.

  • Simplify the model.  Right now I am using a several-layer deep ineption-module network.  I am doing this because I’m not sure what size filters to use or when to use pooling layers.  I admit that that is not the most advanced thinking about how to go about model architecture decisions.

  • Add drop out regularization in between the inception modules.  According the the AlexNet paper, his reduces “complex co-adaptions” between neurons, “since a neuron cannot rely on the presence of particular other neurons.” (AlexNet paper).

Framework For Detecting Heart Attacks From ECG Data Using Neural Networks

I built a framework for training neural networks to detect heart attacks from ECGs.  

Here’s a link:

This tool allows you to build models easily and configure a hyper parameter search strategy.  I have mostly created this tool for my own explorations, but I have designed it to make it easy for a curious AI beginner like me to use.

Discussion on the Process of Building This Tool

Building this framework allowed me to merge many of my skills and interests.  First and foremost, this project allows me to practice building neural networks using tensorflow and keras.  However, in order to train keras models on this dataset, I had to first get the ECG data.  I got the data from a public database of digitized ECG signals. Unfortunately, these digitized ECGs are very different than those that you might have taken in an ambulance or a hospital. In addition to the fact that those are usually printed on a piece of paper, there’s the fact that this dataset has 1000 samples per second, whereas the typical EKG machine samples at 40 Hz. But this is the only publicly available data I could find, so I went with it.

As soon as I started training on that data, I ran into time and memory issues on my Mac.  The models took to long to train and required too much memory. So I learned how to configure a Google Cloud Compute instance with a GPU.  This last task required re-learning how to use Docker. The Keras faq says that Keras automatically works if you have a GPU attached to your instance. I took this to mean that all I had to do was start a Google Cloud Compute instance with a GPU attached as specified in the Google Cloud Compute docs, but it turns out there is some GPU software that you need to have installed on your instance in order to have Keras automatically use your GPU. I tried following this blog post, but it was taking forever to copy the nvidia sofware around in the Starbucks where I was working, and I couldn’t imagine doing that every time that I want to start one of these instances (I turn my instance off when I am not using it). It turned out that using these instructions from Nvidia on how to create a GCP instance with a GPU and all the right software for deep learning are pretty easy to follow. Then I found that tensorflow provides a docker image that allows you to use a gpu provided you have the nvidia software installed, and it was a cinch to get up and running.

Current Research

The most recent model that I’ve worked on is stored in `models/`.  It is based on the work in this paper, where the authors used a single-layer inception module.  I am currently experimenting with different numbers of filters and inception modules.

In Search Of Songs You Love: The Current State Of My Spotify Analytics Project

I want to make a music discovery tool that helps you find songs that you *love* not just songs that you like.  In my quest for that tool, I decided to rely on the fact that I generally discovery my favorite music through my friends--not through musical similarity to songs that I already listen to.

I went with this path, because I don't have a prototypical type of music that I like.  I like some music from many different genres.  I'm not sure how to pinpoint the features of music that tend to make me like it.  One thing I do know though, is that if a lot of people who's musical opinions I trust are listening to a track, it's worth at least giving it a listen myself.

In order to make an algorithm that replicates this process, you really need three ingredients.
1. You need to know what kind of music you like.  What is the input to this algorithm?
2. You have to know what your friends are listening to.
3. You have to know which of your friends are the best authorities on music.

Since I have been a user of spotify since they first came out in America, and since they have pretty good programming interface for looking up information about songs, (which I've used in a different attempt at a music recommendation algorithm), I decided to try to base my algorithm on spotify data.  

The problem is that, while spotify provides a programming interface that lets you look up all kinds of information about a user and about a user's playlists and saved songs (once that user has given you permission), they do not provide any interface to the "social graph"--that is, there is no easy way to find out who a user "follows."  In order to get that data, I wrote a program that visits the [online spotify web player]( and performs the following algorithm:

  1. Store all desired listening/playlists data for a user (call that user "user0").
  2. Visit the "following" tab for user0.
  3. For each user that user0 follows, store a link between those two users.
  4. Add the user id of each user that user0 follows to a queue.
  5. For each user in the queue, perform steps 1-4.
  6. Stop adding users to the queue once they have a specified distance (i.e. number of links from) user0.  In my case, I did not collect any users that were further than 2 links away from the first user, because it would have taken thousands of years or hundreds of dollars in order to deal with that much data.

Once I had the social graph stored, the next step was to specify the criteria by which I would say that a user "likes" a track.  I ultimately decided to use a pretty poor indicator: if a track appears in one of the user's "public playlists," then they like it.  The reason this is not a great metric is that there's no way to know why somebody put a song on a public playlist.  It could be a playlist of songs that they hate, for example, or a playlist of songs that they've generated using a crappy music recommendation algorithm.  It would be much better to know the top tracks that a user listens to.  That is going to be the goal of a future project.  I decided to go with public playlists because you don't need any permissions to see those public playlists, so the logistics of gathering that data are much easier.  (To get private playlists or user listening data, I would have to build an entire website and then convince all of my friends to log into it.)

The final question for this iteration of my music recommendation algorithm is, given the data set acquired above, how do I get a machine to make recommendations?  The algorithm should basically look at what people are listening to, look at what you listen to, and find the set of songs that cooccur most often with songs that you don't listen to in the libraries of your friends.  In a later version of this algorithm, I will take into account the "authority" of recommendations by different users in the network, but for this simple implementation I did not bother.

The recommendation algorithm used a strategy called "bipartite projection."  A network where there are two types of entities (in this case, "users" and "tracks") and the only links in the network are between entities of different types is called a "bipartite graph".  The idea of bipartite projection is to compress the link information of the whole network onto one of the sets of entities.  In this case, I projected the network on to the track set by making a stronger link between tracks if they are liked by many users and a weaker link if they do not have a lot of listeners in common.

The algorithm:
For every track: 
    For every other track in the network: 
        for every user in the network: 
            If the user likes both track i and track j, the score for the track increases in inverse proportion to the amount of tracks that a user likes.  So if a user doesn't like very many tracks, his vote is worth slightly more than somebody who likes a lot of tracks.  

What this calculation ultimately creates is a matrix where each entry tells you, "if every track starts with some amount of 'recommendation capacity', the entry in the matrix tells you the fraction of that capacity to be allocated to the other track.  In order to take into account a users's authority, I will modify the recommendation capacity, but for this implementation, I simply set it to 1 for everybody.

The similarity matrix allows you to generate a ranked list of recommended tracks based on a list of tracks passed into the function.  To make a recommendation based on a list of tracks,

For each track, l, in your list of tracks: 
    Initialize a score of 0.
    For each other track: 
        Increase the score by the sum of recommendation capacity allocated from each track in the overall network by looking up the coordinates of both tracks in the similarity matrix.  
Return the top n tracks with the highest scores.

In my initial test, this algorithm works pretty well, but it's a little slow.  You can't build the whole matrix in memory using the algorithm that I implemented.  The recommendations are good though.  The character of the tracks is pretty well preserved.  A list of rock songs will yield mostly rock.  A list of rap songs will return a bunch of mostly rap. The songs that are recommended are new.

The next thing I would like to do is run this algorithm based on top 100 songs for each user.  Spotify created a "Top 100 songs of 2016" for every user, but in order to get those songs, I will have to make a website and convince people to let me scrape their spotify libraries.

Give Consumers the Keys to Machine Learning

Just a thought that keeps occurring to me: recommendation systems for things like music and movies would be much more effective if they were more explicit in their data gathering.  Collective filtering is amazing and incredibly powerful, and there are incredible ways to make recommendations based on the features of an item on its own--without even considering what other similar people like.

Unfortunately consumers have absolutely no access to the parameters that control their recommendation systems.  When I use Netflix for example, I often find something that I will like eventually, but at no point does Netflix ask me to participate explicitly in my own recommendation process.  I don't rate the movies that I watch, and I think most people don't, because I don't think they realize that the machine learning algorithms in place to help them find movies that they will love depend quite a bit on their input.

As another example, when I use Spotify radio, my mood changes during the hour or so that I am listening to the radio.  The station is not designed to change with time in a way that reflects my mood, but rather to basically degrade as I start vetoing songs that I am tired of hearing.  If there were a way for me to, say, move a slider that corresponds to the "songs with higher energy" data field that no doubt is being used in the algorithm that is governing my music listening, I think I would be able to help myself and Spotify out in generating better personal recommendations for myself.

No. 1 Challenge to Automated Chemical Synthesis

At the end of this review on the outlook of automated chemical synthesis (which you may have trouble accessing, because Nature magazine is weird about open access.  Title is Organic synthesis: The robo-chemist), Dr. Grzybowski says “the only thing that can kill [the effort to make an automated synthesis machine] is scepticism.”  I disagree.  The thing that will kill the effort is closed scientific research.  

Just being able to access further research about all of this is a big enough hurtle in itself, but I have read what I can and here is my reaction to the Professor's claim.

Dr. Grzybowski and his underlings have successfully built and demonstrated the efficacy of a new software that agglomerates abstracts housed on Reaxis, a closed source chemical reaction database, and suggests potential synthetic pathways to a variety of different molecules.

This is a huge feat, no doubt, but it has taken years and years for Dr. Grzybowski’s lab to build this database, which they intend to keep closed source and license to Reaxis for use in their software.

I think that keeping this software closed-source makes great sense economically speaking.  Surely there is a huge need for this kind of software, and the developers of it stand to make loads of money from it.  

However, my prediction is that it will not catch on and it will never end up being very good software, because closed-source necessarily means competition and non-collaboration with other people who are working at developing the very same software and banging their heads against the same exact problems.  

There are thousands of people out there who would be interested in collaborating to make an automated organic synthesis machine, but the person probably best suited to manage that effort is planning to make an inferior product that will sell for much more.

Don’t mistake my scepticism for the “scepticism” that Dr. Grzybowski warns against in the linked article, though.  I have no doubt that a synthesis machine can and will be built.  My scepticism relates to the fact that automating chemical synthesis is an enormous task, and that open collaboration amongst all chemists will the way to do it.  Dr. Grzybowski, being one of the leading chemical researchers in the world ought to recognize this better than anybody else and lead the effort.


A Neat Little Guide to Bring you From Ideas To Things

We use this worksheet to walk ourselves from the inception of an idea to the point where we will will start manufacturing or licensing it.  The key thing to note is that we do some initial market research, write a provisional patent, and then use the first few rounds of cold-calls and sales to get feedback and rethink our product.  The idea is to get a product sales ready as soon as possible.  Sometimes after the first round of talking to manufacturers, we determine that we need to go back to the drawing board completely.

Feel free to use the worksheet above yourselves in any way you choose.  It is basically an extraction of Stephen Key's book, Sell Your Ideas With or Without A Patent.

For a detailed course on how to follow this worksheet through and through, check out our blog post, 3 Essential Books to Teach You How To License Out Your Idea

3 Essential Books To Teach You How to License Out Your Ideas

How to Sell Your Ideas With or Without a Patent, by Steven Key

Really, I ought to recommend both of the Stephen Key books that I’ve read--the other one being One Simple Idea.  Mr. Key takes a patient and charming walk with you from the basics of intellectual property law to the things you should keep in mind while licensing out your idea.  This guy is not just knowledgeable, he is a brilliant strategist, and the advice he gives you on everything from provisional patents to optimizing attorney time flows gem after gem.  His story about defending his “Spinformation” label patent in federal court is pretty wild too.


Patent Pending in 24 hours  , by Richard Stim and David Pressman

Patent Pending in 24 hours, by Richard Stim and David Pressman

Patent Pending in 24 hours, by Richard Stim and David Pressman

Mr. Stim and Mr. Pressman writes a clear, concise, and informative walkthrough on how to write a provisional patent.  But the really value information here, as far as I can tell, is about the preparation for writing a provisional patent.  Mr. Pressman walks you through the patent search process, the decision making process about whether or not to write a patent at all, and how to do the patent drawings before getting into the nitty gritty of writing the patent app itself.  This book is also packed with related IP resources like NDAs and interesting little bits of invention trivia.  If you are interested in writing a provisional patent application yourself, this book is a must-have.


Four Hour Workweek ,     by Tim Ferris

Four Hour Workweekby Tim Ferris

Four Hour Workweek, by Tim Ferris

While Mr. Ferris does not get too deep into the strategy or mechanics of licensing, he lays out a unique, scalable, and comprehensive business model to follow once you’ve convinced a manufacturer to take on your idea.  Mr. Ferris believes that a startup ought to outsource as much of its normal operations as possible, thereby allowing the owner of the business to focus on creating new ideas, businesses, and experiences.  This manual on how to create an “outsourced lifestyle” is, in a way, what licensing your ideas out is all about.

Laser Cut Golden Gate Bridge

Spoiler, basically I just followed this instructable about how to laser cut the golden gate bridge and, after a couple little blips, I came out with what you see here.  The linked article may very well be an excellent tutorial, but out of busy-ness or laziness--I’m not sure which--I’ll admit that I never actually read the article.

Which turned out to be a great learning experience.  

It started yesterday afternoon, when the 3D printer was down and I noticed a gleaming red piece of acrylic in the scrap bin below the laser cutter.  I had just read the instructable linked to above and, being from San Francisco I wanted to have a go at laser cutting the Golden Gate Bridge.

So I downloaded the file provided in the instructable thinking this would be a super quick print-and-assemble job, but I was disappointed to find that there were about 30 copies of the bridge in the file. The file also had the authors name printed on the bridge and the word instructables, neither of which I want sullying my little bridge.  Here is my modified file.

In my first cut the sheet of acryllic was about 3 times thicker than the one that the instructable writer must have used, so the “road” part of the bridge did not fit through the corresponding holes on the tower.  I was also very dissapointed to find that the writer had designed the bridge to scale, which means it is extremely long and awkward looking (iconic views of the bridge are almost always at a steep > 45 degree angle).

I could have easily resized the bridge in the file in about 3 clicks, but by that point the 3D printer was up and running again, so I popped in the thin clear sheet of acrylic that you see in the picture, cut it out, tied some rubber bands on, and now I have a little memento of my far away home on my desk.  (In the process of rubber banding, I broke the "road" in half, so the dimensions you see in the picture will be twice as wide if you cut the file as-is).

If you want to know how to laser cut the bridge, you're probably better off reading the instructable.  But if you want to learn about how the thickness of your material affects your assembly--a lesson I learn again and again on the laser cutter, which you always forget has a third dimension, I recommend giving it a couple of tries with the file provided above.