Explaining the Goal By Way of Explaining This Parallel Axes Chart
It's not going to be easy. The yellow lines in the graph above represent tracks from a playlist of songs that I "love"--personal classics, and the green/teal lines represent data about songs that I just "like". Maybe it's just because it's such a rare and beautiful occasion to find a song that you truly love, but I'm pretty sure that I've discovered a track, album, or artist that I truly "love" (as opposed to "like") from a music recommendation service such as Spotify's "Discover Weekly" playlist, which gets algorithmically created for me every week. It's not that I don't like the songs that Spotify's algorithms recommend to me. Rather it's as though there's some kind of regression towards the mean going on, where I'm basically cool with all the tracks they recommend, but never excited.
What I am starting to investigate right now is how to go that extra step towards a recommendation system that will allow me to discover music that I truly love. The rather messy parallel coordinates graph displayed above is the product of one of my first forays into this question.
What can we tell from that graph?
Two things: Echo Nest is incredible at predicting what music I am going to like based on their criteria (or their criteria are too generic, but based on the fact that spotify recommendations are always in the right ballpark, I'll give them the benefit of the doubt with the former), and this is going to be a very difficult challenge. The fact that those lines are in a state of complete chaos indicates that there is no clear attribute or attributes in the list of attributes that I've collected that will easily allow me to separate songs that I "love" from songs that I've had recommended to me. This may mean that I need to gather more features than those available on the Echo Nest API. (One feature that I would really love to have, but which Spotify does not provide would be personal play count for each song. This would allow me to do a linear regression across all of my music, trying to make an estimator of playcount based on these features).
Below is another way of representing the same conclusion. The medians represented by horizontal lines in the middle of each box in the boxplots below are within one standard deviation of their corresponding box, indicating that there is very little overall difference between the sets. (Sorry for the sloppy labeling and color coding by the way. Green/"Yes" in this figure represents tracks that I love. Blue/"No" encodes songs that I just like.) In future calculations, I'm going to omit time signature, mode, because I don't think they explain any variance. (caviat: I want to do a query that finds the most popular songs in times signatures that are not 3 or 4, and in modes that are not major or minor).
The grand conclusion that I have come to in this first stage of analysis is that there are no obvious--group wide characteristics that will allow me to predict whether or not I will love a song. K-nearest neighbors is probably not a great classifier for this problem (but I'm going to try it anyway), and I may need to come up with some other features for these songs in order to get good separation.
Getting the Data
In order to get the data represented above, I started by choosing two playlists--One containing all of my favorite songs, and one containing songs that I just "like." I then used spotipy, a python wrapper for the Spotify Web API to log in to my account and download playlist metadata (song names, artist, album, etc.) onto disk. I wrote a script that parsed and cleaned this data and put it into a Pandas dataframe.
The second task was to get the song and artist data from the Echo Nest API. To do this, I first looped through every track and made a multiple bucket query to Echo Nest requesting track data like "energy " and "valence" (positivity). I merged this data into my data frame. Then I made a second bucket query for every track requesting artist data like like "hotttness" (whatever that is), and merged that into my dataframe. The visualizations don't show this above, but I also captured a list of genres associated with each track artist and a sometimes huge summary of the artist provided by Echo Nest. I will use these later in a Naive Bayes Classifier.
This second querying task took a long time (10 minutes), but I don't know of a more efficient way to do it. The slowness was exacerbated by the fact that I had to pause the program after each query so as not to exceed the query limit of 20/min.
All of the code for this step is posted on my github. Each step was prototyped in an ipython notebook and then optimized in a script. Everything is posted.