Digging Through A Treasure Trove of FiveThirtyEight Data in Python!

August 17, 2016
Written by

FiveThirtyEightPython

In 2008, Nate Silver came within one percentage point of perfectly predicting the popular vote results of the Presidential Election. In 2014, Silver’s team at FiveThirtyEight recorded exactly how many clouds Bob Ross painted over the course of his show’s 31 seasons.

FiveThirtyEight is not just a crack team of data-driven journalists. They’re also caretakers of a treasure trove of data readily available for our consumption on GitHub. Using Python, we can easily parse through the data to find nuggets of wisdom on everything from Bob Ross, to political polls, to how to break FIFA.

What You’ll Need

You’ll need Python 2 or 3 for this tutorial. I’m using Python 2.7.

You’ll also need Pandas (not the black and white fluffy bears, but the Python friendly kind)

Employing The Help of Pandas

Python, along with Pandas, is going to do a lot of our heavy lifting. The pandas data analysis library gives us a very easy, intuitive way to sort through our csv data, straight from the command line. Once we get set up, we can name what we’re looking for, and get a ton of information with simple commands.

Examining Mission Critical Data — How Many Bushes Did Bob Ross Paint?

Fire up your terminal and create a file called FiveThirtyEight.py

Next, let’s get a python environment going. Type the command python

We need to create an environment fit for pandas. Let’s run the following command:

>>>pip install pandas

Great. We’re ready for Pandas. Let’s invite them in along with numpy, a package for easily computing large sets of numbers. We’ll also use the requests library to get the data from FiveThirtyEight’s GitHub where their csv chock full of Bob Ross data lives.

Tragically, we know this csv won’t be changing any time soon. Bob Ross has passed on and won’t be adding to our data set. R.I.P Bob Ross.

But, If you’re working with csvs that are constantly being updated, using requests ensures you current data. Instead of downloading the latest version of a csv and hoping it’s up to date, you can request the data from a recently updated GitHub repo.

Making Requests To FiveThirtyEight’s GitHub Repo

StringIO operates like an open file command and passes the content of our csv file to pandas as a data set.

>>> import pandas as pd
>>> import numpy as np
>>> import requests
>>> from StringIO import StringIO

 
If you’re using Python 3, you’ll want to use BytesIO and run the command
 

>>>from io import BytesIO

 
Great. Now we’re set up to request data from FiveThirtyEight’s GitHub repo. We’ll do that using requests and the .get command. By defining the response variable and the content variable, we can easily call upon the response of our call to the FiveThirtyEight url and the content from that response. StringIO parses the content as a data set pandas can read, and ensures pandas doesn’t mistake the strings in the csv for urls. Once we have that content, we’ll define it as a dataframe that pandas can read using the .read_csv function.

>>>url ='https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv'
>>>response = requests.get(url)
>>>content = StringIO(response.content)
>>>df = pd.read_csv(content)

 
For the Python 3 folks, run content like so:

>>>content = BytesIO(response.content)

 
Rad! We now have all of our Bob Ross data in Terminal. Let’s see how we can view specific subsets of data. Using the df.head()command, pandas will give data from the first five rows of our csv file. In this case, that’s the first five episodes. If we wanted to see the first 10 episodes, we’d just pass in an argument like:

>>>df.head(10)

Now, let’s answer the burning question — how many clouds did Bob Ross paint?  By calling upon the column “CLOUDS” and asking for the sum of that row, pandas will deliver us the answer.

>>>df ["CLOUDS"].sum()

Bob Ross painted 179 happy little clouds


If we wanted to calculate the probability he’ll paint a cloud in any given episode, we could run

>>>df["CLOUDS"].mean()</code>.</li>

 
Blam. There’s a 44% likelihood that Bob Ross will paint a cloud. But what if we wanted to see how many times Bob Ross has graced his happy little clouds with a tree friend?

By grouping the two rows “CLOUDS” and “TREES” together and calling the sum, we see how many times he painted trees and clouds together.

>>>df.groupby("TREES")["CLOUDS"].sum()

When you run this, you’ll see that under the 0 column, there are 41 instances where Bob Ross painted clouds with no trees, and there are 138 instances where he painted the two together. When you add 41 to 138, you get 179 — the total number of clouds Bob Ross painted.

More Fun With CSVs and Python Ahoy!

Now that you know how to parse and loop through CSV data, you can answer all sorts of questions straight from the command line.

Resources

If you’re looking for more opportunities to flex your Python skills here are a few resources.

 
If you’ve got any hot data, Bob Ross jokes, or other quips — shoot me a note @kylekellyyahner or kyleky@twilio.com