My first foray into sports statistics was a research project in 3rd year for STAT5703, Carleton's Data Mining course. This was our first chance at picking our own dataset to work with aside from the canonical cars and housing ones that come bundled with R. When the choices were out for pickings, my partner and I immediately jumped at Brian Burke's NFL play-by-play. We pored hours into creating new variables, sorting the data into a different format easier for analysis, and finally creating two models; one for probability of winning a given game and the other for number of wins in a given season. It was really satisfying to try a hand at what had always been a blackbox and somewhat succeeding.
Amidst this satisfaction, a few things nagged at me. It was really annoying to make the formulas in Excel and do countless actions of copying/pasting. It would have been nice if the data was in the format my partner and I preferred to work with (each season data was in one Excel file; we ended up splitting each up to an Excel file per team per season). I also follow the NBA way more than the NFL, but unfortunately I couldn't find an NBA dataset online as nicely complete as Brian Burke's NFL one. This feeling of restriction, of being handcuffed and limited by the data I was working with, was a theme throughout the rest of the class. This is *not* a knock on the class, as there was hardly enough time to cover what to do once you have obtained the data - let alone how you go about obtaining it (it is probably the case that many people interested in the field of data mining do not care one bit about how you obtain the data). However, it's my contention that web scraping and the associated data processing skills are a necessary part of the toolbox. It allows you to pursue your own ideas - create your own hypothesis to prove or disprove, gather any information you find relevant to your investigation, and format it in whatever way you believe will maximize your success in finding your findings.
The aim of this tutorial is to be the tutorial I wish I had stumbled upon when choosing my topic for my aforementioned research project. There are plenty of web scraping 101 tutorials out there: this will one with an application to sports, with particular attention given to the data processing and analysis tasks that are necessary afterwards.
The environment going forward with this tutorial will be:
- Python 2.7.5
- Mac OS X
- Google Chrome
where you can find the definitive guide on proper installation of Python here
. (If you have a different version of Python, or are on a different Operating System, the necessary tweaks to follow this tutorial are minor and can be found through some Google searches.)
If this is new territory for you, ignore the section on virtual environments. Once you have pip
installed, you can run the following commands to install the required libraries to complete this walkthrough.
pip install BeautifulSoup4
pip install selenium
pip install docopt
Before we jump into it, there's two guidelines to be conscious of:
- The site's terms and conditions;
some sites explicitly disallow scraping of their content.
- Every site also has a robots.txt file, which contains rules about which crawlers are allowed/disallowed as well as a crawl delay. Always respect the crawl delay!
We want to build a system that can
- gather play-by-play information from the CFL's website
- store it in an intelligible manner
- extract new insights
Specifically, I had a colleague interested in 3rd down plays that don't result in punts (the CFL's possessions only have 3 downs compared to the standard 4 downs).
The end result of the full source code can be found here: https://github.com/stevenwu4/CFL
Part 1: Implementation of the Scraper
Investigating the Data
We want a function that, given a URL from the CFL website, will scrape the play-by-play. Let's look at the first game of the 2015 season for the Ottawa Redblacks: http://www.cfl.ca/statistics/statsGame/id/12833.
All a web scraper does is open up a URL's content - much like a human would with a browser like Google Chrome or Mozilla Firefox - and proceed to extract information from the source content. On the web page, right click and click 'View Source'. Here, we run into our first roadblock: the play-by-play doesn't appear to be in the source! Try CMD + F for the first few plays in the game:
- "Medlock kicks off"
- "Burris pass to"
- "Burris incomplete pass"
None of these will have any results found.
That's weird; if we click the Play by Play
button, we see the play-by-play data right there. How can it be there on the webpage and yet at the same time not be there? Pay attention to the URL; if it changed upon clicking the button, we could just follow the redirect to the new URL and scrape that URL's content - but it doesn't change.
Writing the Scraper - Dealing with the Play By Play Button
To go ahead with one solution to this problem, we're going to be using a package called Selenium. Selenium allows us to use Python to programatically open up a browser and interact with the different elements of the page. Specifically, it will allow us to click the elusive button that contains the play-by-play HTML we desire.
BeautifulSoup is a library that will take the resulting HTML after the Play by Play button click and allow us easier, more organized access to the information we actually want.
from bs4 import BeautifulSoup
from selenium import webdriver
PLAYBYPLAY_BTN_XPATH = '//div[@id='playbyplay-button']/a'
library is necessary to respect the rules of http://www.cfl.ca/robots.txt
. At the time of this writing, all crawlers are allowed but there is a crawl delay of 2. The time
library has a function, sleep
, which lets us delay our crawls.
The constant PLAYBYPLAY_BTN_XPATH
is an XPath, a language that allows for specifically identifying elements in our HTML document
. We use it to locate the Play by Play
button. This begs the question of how to derive the XPath. There's two options here:
- Use references (the link above is a good start, but any of the top results for Googling XPath work fine too) to learn the language. Like any language, it will take getting used to but once you are comfortable with it you'll never look back.
- Use auxiliary tools, like Google Chrome's Dev Tools (which can be activated by right-clicking the page and selecting Inspect Element) which allows you a right-click option of 'Copy XPath' in the Elements tab (see image below), or popular browser extensions such as XPath Helper that can directly give you a working XPath to use for any element on the page.
Now we can start creating the function.
driver = webdriver.Firefox()
playbyplay_btn = driver.find_element_by_xpath(PLAYBYPLAY_BTN_XPATH)
soup = BeautifulSoup(driver.page_source)
Our function get_game_rows_from_url
will receive a url
as input. We use the Selenium
to open up an instance of Firefox on your computer (note: you need Firefox downloaded
for this to work!). find_element_by_xpath
is a function of the webdriver Firefox instance that returns an element of the HTML based on the XPath you give it (in our case, the XPath to the Play by Play
button that we defined as PLAYBYPLAY_BTN_XPATH
). This returned element has a function click
that clicks our button for us. We load the resulting HTML into BeautifulSoup
to focus on the scraping.
Writing the Scraper: Extracting the Play By Play
Continuing along in get_game_rows_from_url, after making our BeautifulSoup object:
# Get away/home teams
away_div = soup.find('div', id='awayteam')
away_team = away_div.find('h3', class_='cityname').text
home_div = soup.find('div', id='hometeam')
home_team = home_div.find('h3', class_='cityname').text
# Get game rows
pbp_div = soup.find('div', id='stat-game-pbp')
pbp_inner_div = pbp_div.find('div', id='pbp-stats')
pbp_table = pbp_inner_div.find('table', id='pbp-table')
rows = pbp_table.find_all('tr')
all_times = 
all_downs = 
all_types = 
all_yards = 
all_details = 
all_aways = 
all_homes = 
for row in rows:
# The rows we care about don't have the th tag
cells = row.find_all('td')
header_row = [
'Time', 'Down', 'Type', 'Yards',
'Details', away_team, home_team
list_of_game_rows = [header_row]
for t, down, types, yards, details, away, home in zip(
all_times, all_downs, all_types, all_yards,
all_details, all_aways, all_homes
new_row = 
BeautifulSoup4 has excellent documentation, but it can be daunting if you are new to the concepts or new to Python itself. Two rules of thumb that made it easiest for me when getting the hang of things: 1) The find
functions are your best friend, and 2) starting from the soup you obtain by feeding in the original HTML, any resulting object from using find
can also use find
on itself. For an example of what I'm talking about, let's look at the 3 lines of code that lead to us getting the table containing the play-by-play data, pbp_table
. Below is a screenshot that shows how we arrive at getting the rows.
Note: you can consolidate the lines of code that use the above strategy of recursively using find
on elements by using XPaths. I strictly use XPaths now after becoming more comfortable with this type of work. Examplem XPaths that would return an equivalent list to rows
are '//div[@id="stat-game-pbp"]/div[@id="stat-game-cat"]/table[@id="pbp-table"]//tr' or the simpler '//table[@id="pbp-table"]//tr'
We grab the away and home teams that are playing the specific game. We identify the table that will contain the rows of our play-by-play. From there, we grab all of the data we care about and store the play-by-play as a list of lists, where each inner list is a row of the play-by-play. Finally, we close the Firefox instance that we opened.
To view the final version of the file, visit https://github.com/stevenwu4/CFL/blob/master/Services/pbp_scraper_for_game.py
A couple of notes on differences you'll see:
- There's an added input parameter save_to_dest to the function get_game_rows_from_url, which optionally saves the HTML content locally to whatever path value the parameter specifies.
- There's a function imported, write_to_csv, which will take the list of lists and convert it to a .csv for storage.
- Using the extremely handy library docopt, the script can be used from the command line. For example, in the /CFL directory, you would type:
PYTHONPATH=. python pbp_scraper_for_game.py
Now that we're done implementing our scraper, we need to use it to grab the data. Now, we *could* just manually take every URL for a season, and call this script from the command line to have the play-by-play we want. However, it's good practice to automate as much as possible in our workflow. In the next part of this series, I will go over how to flesh out a basic system that will take care of a workflow that will use this scraper to collect the data and store it locally onto your machine.
>> Click here to go forward to Part 2 in this series