MongoDB and Data Analysis of CFL Play-By-Play Data in Python

Aug 3, 2015

In Part 2, we wrote a command line interface to a Python program that will handle grabbing all of the CFL play-by-play data for any season in a matter of a few clicks and keystrokes.
To finish this tutorial, we're going to migrate our data from a folder & file structure to a more sophisticated method of data storage, using a NoSQL database called MongoDB. There are many reasons as to why one should use a database, whether it's a relational (SQL) or non-relational (NoSQL) one, but one example of a pain point we're addressing here is that our current data storage has two copies of each game for a given season. Remember, our end goal is to have a list of 3rd down plays that don't result in punts for a season. To do this, we're going to need to look through every game. If we naively looked through every game in, say, the 2013 season, then we would end up with duplicates of every such row. It would be annoying and inefficient to code up logic to handle the duplicates when we have the ability to design our data source in a better way.

Although we're using MongoDB in this case, we really could have chosen any number of other databases. I prefer it because a) it works beautifully with Python and b) its lazy, flexible nature (it plugs in a lot of gaps on the fly for you) complements data analysis work that is usually ad-hoc and quick in nature. sqlite3 is a good choice for a SQL alternative.

Part 3: Implementation of MongoDB Storage and Analysis Script

Before we get into the implementation, there's some knowledge regarding MongoDB worth going over.

MongoDB Introduction

If you want a comprehensive tutorial and introduction, I encourage and recommend the documentation on their website. For a much more brief primer, all you need to know is:

Implementation of the Data Migration from Csvs to MongoDB

Here's what we want accomplished:

Let's set up our program's docopt interface - remember what we did at the end of Part 2 after writing our functions? This time, we'll write the interface first instead, and work on down to implement the missing pieces required to finish our program.

"""
Usage:
    store_csv_info_to_db.py (--season=)
"""
import os
from md5 import md5
from pymongo import MongoClient
from datetime import datetime
from docopt import docopt

from Services.constants import PATH_TO_DATA
from Services.services import convert_csv_to_list, get_teams_for_given_season

if __name__ == '__main__':
    args = docopt(__doc__)
    season = args['--season']
    cities = get_teams_for_given_season(season)

    run_storage_of_games_on_season(season, cities)
Nothing new here except for some libraries we're importing, but I'll expand on those when the time is right. When running this script, we will want to supply the season year of the data we want to migrate. In Part 2, we defined a handy get_teams_for_given_season function; so handy in fact that we will move it to a general services.py file so that multiple files may use its functionality through imports. Notice that from the same file, we are importing another useful function, convert_csv_to_list. As the name suggests, it will take the path to a .csv and return it as a list of lists (which we will be dumping into the "list_of_game_rows" field of a game document). This is defined as such:
def convert_csv_to_list(path_to_csv):
    with open(path_to_csv, 'rU') as f:
        reader = csv.reader(f)
        list_of_game_rows = [row for row in reader]

    return list_of_game_rows
Using the season given and its associated list of teams, we want to run the migration using run_storage_of_games_on_season. Let's write that function now.
def run_storage_of_games_on_season(season, list_of_cities):
    client = MongoClient()1
    season_db = client['CFL_' + season]2
    games_collection = season_db['games']

    for city in list_of_cities:
        print 'Storing games for {0}'.format(city)
        path_to_csvs = os.path.join(
            PATH_TO_DATA, season, city, 'Csvs'
        )
        all_csvs = [f for f in os.listdir(path_to_csvs) if f.endswith('.csv')]
        for csv in all_csvs:
            game_data = get_info_from_game(season, city, csv)3
            saved_game = games_collection.find_one(
                {'game_id': game_data['game_id']}4
            )
            if not saved_game:
                games_collection.insert(game_data)5
  1. We initialize a client for a MongoDB instance; I will show you how to run a local MongoDB server for this client when I go over running these files at the end of the article.
  2. From our client, we can access databases using key look-ups. The convention for our database names for this project will be 'CFL_' prepended to the season, e.g: CFL_2013. Collections can similarly be accessed by name using a key look-up on a database object (here, our games collection is named 'games').
  3. Up to this point, we just iterate through all of the teams, find all of the .csvs, and for each .csv, we transform it into a document that we want to store in MongoDB. We'll define get_info_from_game immediately after these notes.
  4. Now we can callback to my previous promise of explaining the reasoning behind "game_id" being a necessary key in our game document. MongoDB automatically assigns a unique "_id" to an *inserted* document, but our .csv's have no such distinction of uniqueness aside from the play-by-play contents itself (with the assumption that no two games will have the exact same play-by-play, which is pretty safe given how detailed the information is in the play description). The workflow for each .csv is basically: transform into a document -> check if this game isn't already in MongoDB -> if it isn't add it in, otherwise we don't need to do anything. We do the check by using find_one - pymongo's interface to findOne - to find a game document with the same "game_id" hash, which would imply the same game. Note that find_one takes a dictionary as the query argument; adding more key-value pairs would make your query even more specific.
  5. Remember that a collection can have searching/adding/removing/updating of documents. Here, we're adding.

  6. We are finished with this task once we are done going through the details of turning our .csv into a document by implementing get_info_from_game.
    def get_info_from_game(season, city, csv):
        game_data = {}1
        game_data['away_team_info'] = {}
        game_data['home_team_info'] = {}
    
        csv_path = os.path.join(PATH_TO_DATA, season, city, 'Csvs', csv)
    
        list_of_game_rows = convert_csv_to_list(csv_path)2
    
        away_city = list_of_game_rows[0][5]
        home_city = list_of_game_rows[0][6]
    
        game_data['created_on'] = datetime.today()3
        game_data['away_team_info']['city'] = away_city
        game_data['home_team_info']['city'] = home_city
        game_data['list_of_game_rows'] = list_of_game_rows
    
        game_rows_as_text = ''.join([''.join(row) for row in list_of_game_rows])4
        game_data['game_id'] = md5(game_rows_as_text).hexdigest()
    
        return game_data
    1. Remember that a document is just a dictionary.
    2. Here we're using the service function I defined above.
    3. Here we're making use of the datetime library to mark the exact time of creation of this object.
    4. We use the standard library md5 to create the hash id for our "game_id" using our play-by-play game rows. We use the .join() function to flatten our list of lists into a string to pass into the md5 function, as it doesn't know what to do with a list.
    The benefits may not be immediately obvious with this example, as our document has only a minor amount of additional information compared to our .csv. However, the benefits will grow as your project grows; you can add fields that contain meta information about the game (for instance; the starters of the game). Assume that you have a written function that finds the starters of a game - with a .csv you would have to call it every time you wanted to access that information from a game (this is bad practice that grows worser with the run time of the function). With our database instead, you could just call it once, store that in the document in a field, and then access it by the field key whenever you wanted the information.

    Running the Data Migration from Csvs to MongoDB

    Before running this, you need a server running. You can do this locally by simply opening up a new tab in your Terminal and typing:

    mongod
    Running the following command in another Terminal tab will get your .csv's in 2013 migrated over to MongoDB:
    PYTHONPATH=. python Tasks/store_csv_info_to_db.py --season=2013
    You should see the following output:
    Image of migration output
    To see your changes, reproduce the following code in your own Terminal (the commands should be sufficient for you to learn enough by doing):
    mongo1
    show dbs2
    use CFL_20133
    show collections4
    db.games.find().count()5
    db.games.findOne()6
    
    1. This command is your interface to MongoDB from the command line. You should see output indicating that you are now in the Mongo shell, which can interpret the following five commands.
    2. This command lists all databases for this client.
    3. This command uses the 'CFL_2013' database that our previously ran script created.
    4. This command shows all of the collections in the 'CFL_2013' database; right now, we only have the 'games' collection.
    5. This command gives you the number of total game documents we have in the collection; you should have 85, which agrees with the number of games at the CFL 2013 schedule website.
    6. This command shows you what one of the game documents looks like.

    Getting All 3rd Down Plays

    At this point, there are no new tools needed to be learned to get the job done. We'll use what we learned about pymongo to connect to our database, and for each game, we do a simple conditional check: if the down is a 3rd down and the (conveniently stored) play type is not a punt, we'll add that play as a row to our resulting output .csv.

    """
    Usage:
        get_3rd_down_plays_excl_punts.py (--season=)
    """
    
    import csv
    from pymongo import MongoClient
    from docopt import docopt
    
    DOWN_INDEX = 1
    TYPE_INDEX = 2
    
    
    def write_3rd_down_plays_excl_punts_for_season(season):
        client = MongoClient()
        season_db = client['CFL_'+season]
        games_collection = season_db['games']
    
        filename = '{0}_3rd_down_plays_excl_punts.csv'.format(season)
        with open(filename, 'w') as f:
            writer = csv.writer(f)
            for game in games_collection.find():
                for row in game['list_of_game_rows']:
                    down = row[DOWN_INDEX].strip()
                    play_type = row[TYPE_INDEX].strip()
                    if down == '3' and play_type != 'Punt':
                        writer.writerow(row)
        print 'Done creating {0}'.format(filename)
    
    
    if __name__ == '__main__':
        args = docopt(__doc__)
        season = args['--season']
    
        write_3rd_down_plays_excl_punts_for_season(season=season)
                

    The Result

    Image of final csv of 3rd down plays without punts
    To view the final version of these files, visit https://github.com/stevenwu4/CFL/blob/master/Tasks/store_csv_info_to_db.py and https://github.com/stevenwu4/CFL/blob/master/Analysis/get_3rd_down_plays_excl_punts.py

    There we have it - the 3rd down plays we were looking for, for any season. From here, we could go a number of different directions:

    This work could have been condensed into one script, but if bugs occur in any part of the workflow (to name a few examples: it could break in the scraping when the site changes, or if you wanted to edit the desired output to exclude punt singles as well as punts), then you would need to restart the whole process - wastefully re-requesting HTML data that you've already seen. This process lends to results that are reproducible and scalable, which are attributes you should always strive for when programming. My hope is that this tutorial not only helps you understand why building a system of programs in this manner is more desirable, but *how* to do actually go about doing it. Thanks for reading!

    >> Click here to go back to Part 1 in this series
    >> Click here to go back to Part 2 in this series