Last week, I embarked on a mission to make a machine learning classifier that would be able to predict whether I would love a song that was randomly fed to me through a recommendation system such as LastFM, Pandora, or Spotify Radio. The motivation for this study is the fact that I generally don't find using these music recommendation services very fruitful in helping me discover music that I truly love. These services are close, but generally I find that the music is off in some way that I can't put my finger on. Almost always the right ballpark, but never on the mark. The reason this is important though is that I know there is music out there--perhaps even songs that I am likely to have recommended to me, that these services are not showing me.
Basically, what I've done in the last week is look at a large variety of classifiers and see how all of them do on a dataset with a list of songs that I absolutely love versus songs that I've had recommended to me. The dataset was scraped from the Echo Nest and the Spotify API.
First, I built two playlists on Spotif--One that aggregates tracks that have been recommended to me, and the other is made up of about 100 of my favorites songs. I gathered data about the tracks and artists from the Spotify and Echo Nest APIs and I assembled this data into a Pandas dataframe. Each row of the dataframe corresponds to a track, and each is labeled either with a "1" indicating it came from the set of tracks that I "love", or "0" indicating that it came from the playlist of songs that I just "like". The columns of my dataset also included metadata such as track_artist, spotify_id, etc. but also a set of 14 numerical columns corresponding to features I scraped from Echo Nest ('track_popularity', 'discovery', 'familiarity', 'hotttnesss', 'acousticness', 'danceability', 'duration', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence'). I don't know what all of these mean, but each track varies within the vector space of these features. In my previous blog post, I show a graph that shows each track on set of parallel axes corresponding to each of these features. I also show a set of boxplots showing the distributions in each of these features for each set of songs.
Each row also had a set of aggregated artist reviews. I dropped any tracks that were missing data from any column.
The dataset was imbalanced, with about two thirds of the 95 tracks remaining tracks coming from the set of songs that I "love." This is a big problem, I think, because it does not reflect the reality of the situation, that I am recommended a ton of songs, but I only "love" a handful of them. However, I had to make do with this imbalanced dataset, because I only started aggregating songs that I am being recommended last week.
I am currently building up a playlist of songs that I really like of those that have been recommended to me. Repeating my study on this dataset will be much more true to the reality of the situation and give more sensible results. As you will see in the next section, recall for songs that I "like" is extremely low and precision for identification of songs I 'love' is also very low, which may be an artifact of my having comparatively few songs in that class while training my classifiers to identify songs that I "love".
I used three different classification techniques: learning from echo nest features, learning from analysis of aggregated reviews, and stacked ensembles.
Learning from Echo Nest Numerical Features
Pretty simple. I fed the Echo Nest numerical features into a variety of binary classifiers, cross validated, and plotted the results. The only classifiers I tried on their own, outside of an ensemble were logistic regression and k-nearest neighbors. Neither was obtained higher than 60% accuracy on its own. The f1-score for identifying songs I love with logistic regression could stayed right around .6 as well, which (I think) is what it would be if randomly classifying the data.
I did not plot precision/recall for the k-nearest neighbor or kmeans classifiers that I performed, because I was interested in seeing if there was any clear groupings of songs with very high concentrations of songs that I love. There were, of course, but not clear if that is due to the fact that my dataset was imbalanced.
Learning from Echo Nest Artist Reviews With Naive Bayes Classifier
I tried two different versions of Multinomial Naive Bayes on the reviews for each of the track_artists. The first was a "naive" approach, where I used a tfidf vectorizer and scikit learn's "GridSearchCV" function to make the best model I could using just these two tools. The grid search took all night to run, though...
The second approach was more complex. First I loaded a dictionary of word sentiments from SentiWordNet. I then used scikit learns "FeatureUnion" function to incorporate word sentiments and part of speech (also given in the SentiWordNet data) into my Naive Bayes classifier for each word.
The simple model actually performed much better than the complex model, as it turned out. This may be because all reviews are likely to be strongly opinionated. The simple approach may take a more "objective stance." The discrepancy also may be due to the fact that I was not able to perform a gridsearch on the union model (takes too long). the simple model was able to reach an f-score of 80 percent consistently, but we should take this with a grain of salt, because the recall for "love" was 100%, and for "like" 0%, indicating that my classifier may have just classified everything as "love."
I performed two different ensemble approaches. Both used a stacked logistic regression to make a weighted average for a set of predicted probabilities given by several classifiers.
The first approach used a logistic regression and a simple naive bayes classifier on the bottom level. Using an ensemble allowed me to boost my f1 score up to 0.85, which is significant, even given the unbalance in the data.
The second approach was basically a free for all, where I threw every ensemble classifier I could find at the problem. On the surface, this approach did not work well at all, because, the classifiers had very strong disagreements with each other, as you can see in this graph showing the coefficient weights of each classifier's predictions in the upper level of the stack.