Data Scraping CFL Play-By-Play Using BeautifulSoup & Selenium in Python

July 27, 2015

My first foray into sports statistics was a research project in 3rd year for STAT5703, Carleton's Data Mining course. This was our first chance at picking our own dataset to work with aside from the canonical cars and housing ones that come bundled with R. When the choices were out for pickings, my partner and I immediately jumped at Brian Burke's NFL play-by-play. We pored hours into creating new variables, sorting the data into a different format easier for analysis, and finally creating two models; one for probability of winning a given game and the other for number of wins in a given season. It was really satisfying to try a hand at what had always been a blackbox and somewhat succeeding.

Amidst this satisfaction, a few things nagged at me. It was really annoying to make the formulas in Excel and do countless actions of copying/pasting. It would have been nice if the data was in the format my partner and I preferred to work with (each season data was in one Excel file; we ended up splitting each up to an Excel file per team per season). I also follow the NBA way more than the NFL, but unfortunately I couldn't find an NBA dataset online as nicely complete as Brian Burke's NFL one. This feeling of restriction, of being handcuffed and limited by the data I was working with, was a theme throughout the rest of the class. This is *not* a knock on the class, as there was hardly enough time to cover what to do once you have obtained the data - let alone how you go about obtaining it (it is probably the case that many people interested in the field of data mining do not care one bit about how you obtain the data). However, it's my contention that web scraping and the associated data processing skills are a necessary part of the toolbox. It allows you to pursue your own ideas - create your own hypothesis to prove or disprove, gather any information you find relevant to your investigation, and format it in whatever way you believe will maximize your success in finding your findings.

The aim of this tutorial is to be the tutorial I wish I had stumbled upon when choosing my topic for my aforementioned research project. There are plenty of web scraping 101 tutorials out there: this will one with an application to sports, with particular attention given to the data processing and analysis tasks that are necessary afterwards. The environment going forward with this tutorial will be:

where you can find the definitive guide on proper installation of Python here. (If you have a different version of Python, or are on a different Operating System, the necessary tweaks to follow this tutorial are minor and can be found through some Google searches.)
If this is new territory for you, ignore the section on virtual environments. Once you have pip installed, you can run the following commands to install the required libraries to complete this walkthrough.
pip install BeautifulSoup4
pip install selenium
pip install docopt
Before we jump into it, there's two guidelines to be conscious of:
  1. The site's terms and conditions; some sites explicitly disallow scraping of their content.
  2. Every site also has a robots.txt file, which contains rules about which crawlers are allowed/disallowed as well as a crawl delay. Always respect the crawl delay!

The Goal

We want to build a system that can

  1. gather play-by-play information from the CFL's website
  2. store it in an intelligible manner
  3. extract new insights
Specifically, I had a colleague interested in 3rd down plays that don't result in punts (the CFL's possessions only have 3 downs compared to the standard 4 downs).
The end result of the full source code can be found here: https://github.com/stevenwu4/CFL

Part 1: Implementation of the Scraper

Investigating the Data

We want a function that, given a URL from the CFL website, will scrape the play-by-play. Let's look at the first game of the 2015 season for the Ottawa Redblacks: http://www.cfl.ca/statistics/statsGame/id/12833.

Image of play-by-play

All a web scraper does is open up a URL's content - much like a human would with a browser like Google Chrome or Mozilla Firefox - and proceed to extract information from the source content. On the web page, right click and click 'View Source'. Here, we run into our first roadblock: the play-by-play doesn't appear to be in the source! Try CMD + F for the first few plays in the game: None of these will have any results found.
No results for play-by-play found in source!

That's weird; if we click the Play by Play button, we see the play-by-play data right there. How can it be there on the webpage and yet at the same time not be there? Pay attention to the URL; if it changed upon clicking the button, we could just follow the redirect to the new URL and scrape that URL's content - but it doesn't change.

Writing the Scraper - Dealing with the Play By Play Button

To go ahead with one solution to this problem, we're going to be using a package called Selenium. Selenium allows us to use Python to programatically open up a browser and interact with the different elements of the page. Specifically, it will allow us to click the elusive button that contains the play-by-play HTML we desire.
BeautifulSoup is a library that will take the resulting HTML after the Play by Play button click and allow us easier, more organized access to the information we actually want.

import time
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
PLAYBYPLAY_BTN_XPATH = '//div[@id='playbyplay-button']/a'
The time library is necessary to respect the rules of http://www.cfl.ca/robots.txt. At the time of this writing, all crawlers are allowed but there is a crawl delay of 2. The time library has a function, sleep, which lets us delay our crawls.
The constant PLAYBYPLAY_BTN_XPATH is an XPath, a language that allows for specifically identifying elements in our HTML document. We use it to locate the Play by Play button. This begs the question of how to derive the XPath. There's two options here:
  1. Use references (the link above is a good start, but any of the top results for Googling XPath work fine too) to learn the language. Like any language, it will take getting used to but once you are comfortable with it you'll never look back.
  2. Use auxiliary tools, like Google Chrome's Dev Tools (which can be activated by right-clicking the page and selecting Inspect Element) which allows you a right-click option of 'Copy XPath' in the Elements tab (see image below), or popular browser extensions such as XPath Helper that can directly give you a working XPath to use for any element on the page.
Image of screen when using Chrome's Dev Tools to copy an XPath

Now we can start creating the function.
def get_game_rows_from_url(url):
    driver = webdriver.Firefox()
    driver.get(url)
    time.sleep(10)
    playbyplay_btn = driver.find_element_by_xpath(PLAYBYPLAY_BTN_XPATH)
    playbyplay_btn.click()
    soup = BeautifulSoup(driver.page_source)
Our function get_game_rows_from_url will receive a url as input. We use the Selenium package's webdriver to open up an instance of Firefox on your computer (note: you need Firefox downloaded for this to work!). find_element_by_xpath is a function of the webdriver Firefox instance that returns an element of the HTML based on the XPath you give it (in our case, the XPath to the Play by Play button that we defined as PLAYBYPLAY_BTN_XPATH). This returned element has a function click that clicks our button for us. We load the resulting HTML into BeautifulSoup to focus on the scraping.

Writing the Scraper: Extracting the Play By Play

Continuing along in get_game_rows_from_url, after making our BeautifulSoup object:

    # Get away/home teams
    away_div = soup.find('div', id='awayteam')
    away_team = away_div.find('h3', class_='cityname').text
    home_div = soup.find('div', id='hometeam')
    home_team = home_div.find('h3', class_='cityname').text

    # Get game rows
    pbp_div = soup.find('div', id='stat-game-pbp')
    pbp_inner_div = pbp_div.find('div', id='pbp-stats')
    pbp_table = pbp_inner_div.find('table', id='pbp-table')
    rows = pbp_table.find_all('tr')

    all_times = []
    all_downs = []
    all_types = []
    all_yards = []
    all_details = []
    all_aways = []
    all_homes = []

    for row in rows:
        # The rows we care about don't have the th tag
        if row.th:
            continue
        cells = row.find_all('td')
        all_times.append(cells[2].text.strip().encode())
        all_downs.append(cells[3].text.strip().encode())
        all_types.append(cells[4].text.strip().encode())
        all_yards.append(cells[5].text.strip().encode())
        all_details.append(cells[6].text.strip().encode())
        all_aways.append(cells[7].text.strip().encode())
        all_homes.append(cells[8].text.strip().encode())

    header_row = [
        'Time', 'Down', 'Type', 'Yards',
        'Details', away_team, home_team
    ]
    list_of_game_rows = [header_row]

    for t, down, types, yards, details, away, home in zip(
        all_times, all_downs, all_types, all_yards,
        all_details, all_aways, all_homes
    ):
        new_row = []
        new_row.append(t)
        new_row.append(down)
        new_row.append(types)
        new_row.append(yards)
        new_row.append(details)
        new_row.append(away)
        new_row.append(home)
        list_of_game_rows.append(new_row)

    driver.close()

    return list_of_game_rows
BeautifulSoup4 has excellent documentation, but it can be daunting if you are new to the concepts or new to Python itself. Two rules of thumb that made it easiest for me when getting the hang of things: 1) The find and find_all functions are your best friend, and 2) starting from the soup you obtain by feeding in the original HTML, any resulting object from using find can also use find on itself. For an example of what I'm talking about, let's look at the 3 lines of code that lead to us getting the table containing the play-by-play data, pbp_table. Below is a screenshot that shows how we arrive at getting the rows.
Image of HTML of play-by-play

Note: you can consolidate the lines of code that use the above strategy of recursively using find on elements by using XPaths. I strictly use XPaths now after becoming more comfortable with this type of work. Examplem XPaths that would return an equivalent list to rowsare '//div[@id="stat-game-pbp"]/div[@id="stat-game-cat"]/table[@id="pbp-table"]//tr' or the simpler '//table[@id="pbp-table"]//tr'
We grab the away and home teams that are playing the specific game. We identify the table that will contain the rows of our play-by-play. From there, we grab all of the data we care about and store the play-by-play as a list of lists, where each inner list is a row of the play-by-play. Finally, we close the Firefox instance that we opened.

The Result

Image of final play-by-play
To view the final version of the file, visit https://github.com/stevenwu4/CFL/blob/master/Services/pbp_scraper_for_game.py.

A couple of notes on differences you'll see: Now that we're done implementing our scraper, we need to use it to grab the data. Now, we *could* just manually take every URL for a season, and call this script from the command line to have the play-by-play we want. However, it's good practice to automate as much as possible in our workflow. In the next part of this series, I will go over how to flesh out a basic system that will take care of a workflow that will use this scraper to collect the data and store it locally onto your machine.

>> Click here to go forward to Part 2 in this series