This guide is to help you understand where to find data on sports, especially free sports data. Most of it requires coding so I’m working on creating a data scraping guide too. To skip all this work, sign up at Sports Data Direct and let them deal with getting the data for you.
Contents
MLB Data
The MLB has the most freely available data out of any sport. That’s because baseball captured the fascination of statisticians like Bill James, Pete Palmer and Dick Cramer who went on to create the field known as Sabermetrics.[1]http://sabr.org/sabermetrics
Stats
Stats are fine to scrape per copyright law.[2]https://www.sports-reference.com/data_use.html
- Lahman’s Baseball Database
- Contains data aggregated on the season level on just about everything you could want
- Updated daily on Github chadwickbureau/baseballdatabank
- View the readme
- Baseball Reference
- Best source including game by game information but requires scraping
- Data coverage
- Scraping limits ~3 seconds between requests
- Retrosheet
- Contains game by game, park statistics, and even play by play!
- A bit messy data formatting but very comprehensive
Salary Information
- Doug Pappas, SABR Business of Baseball Committee
- Contains a variety of extracts but generally through 2004.
- Baseball Prospectus
- More current salary info
- Discussion of the above linked salary discussion from Baseball Reference
MLB Models
Fuzzy waters, legality questionable.
- FiveThirtyEight’s ELO Rankings
- Updated daily
- Recreate the model by scraping The Complete History of MLB and reading their articles
- Sabermetrics course
- Gives overview of different features in baseball like Value over Replacement Player (VORP) among others.
NFL Data
Stats
- Pro Football Reference
- Most comprehensive source.
- Contains just about everything including combines, postseason, salaries, weather, Vegas lines, and play by play
- Data coverage
- Most comprehensive source.
- Football DB
- Comprehensive stats in easy to scrape format
- A bit more limited compared to Pro Football Reference
NFL Models
- Fantasy Football Analytics
- By far the best apporach to point projections I’ve seen
- Weekly updated statistics containing an aggregated model of different site projections
- R scraping scripts
- FiveThirtyEight’s NFL Elo Rankings
- Updated daily or you can recreate the algorithm <how to guide with scraping coming soon>
- Scrape their Complete History of the NFL to test your implementation
- Historical performance against the spread — <~52% article coming soon>
NBA Data
Stats
- Basketball Reference
- The most comprehensive source of basketball stats
NBA Models
- FiveThirtyEight’s Player Projections — CARMELO
- The most involved published model I’ve seen
- How to scrape this guide coming soon
- FiveThirtyEight’s NBA Elo Rankings
- They published the data and crude description of model
- View my recreation of their algorithm with past performance results so you aren’t dependent on his site
NHL Data
Stats
- Hockey Reference
- The most comprehensive source of hockey stats
NHL Models
- Glicko Rating applied to the NHL
- An alternative team strength model inspired by Elo
- Contains a codebase in R
General
Legality
Per Sports-Reference.com, “copyright law is clear that facts cannot be copyrighted, so you are free to reuse facts on this site in accordance with copyright laws.”
News Outlets
The main use of news outlet sites should be for injuries or projections. Besides that they aren’t great.
- CBS Sports
- Great in season coverage
- Spotty data quality on trades and college
- Fox Sports
- ESPN
- Yahoo Sports
Fantasy Football
Finding historical data on fantasy football is tough.
- Yahoo offers a lot of their data free through the YDN — code I used to analyze it here
- This is a good source of historical fantasy football projects — change the year in the url to see different years
Daily Fantasy Data
Probably legal to scrape this information
- Rotogrinders
- Daily updated salary, odds, and starting lineup information
- Much easier than manually downloading the csv from Draft Kings or Fan Duel every day
- Lineup generator
- Brainy DFS
- Historical ownership on Fanduel — the rest of the links look broken at this time
- Responds to data requests if you want more than 2 weeks of history
- Roto World
- Comprehensive source of starting lineups, latest news, and historical news
- Think of this like a Sports Reference site tailored to Daily Fantasy Sports
- Rotoguru
- Historical salary information
- Inconsistent years and may now be defunct
Daily Fantasy Sites
- Draft Kings
- Data is behind a login wall — How to scrape Draft Kings
- Contains every lineup entered in every contest
- Fan Duel
- Data behind paywall
- More information coming soon
Odds
Probably legal to scrape (similar class to daily fantasy)
- Covers
- Historical spreads, moneylines, over/unders dating back 5 years
- Contains stats but kinda a pain to scrape anything besides hits, errors and final scores.
- I would just use this for odds data and final scores
- Oddshark
- Great source of daily odds across bookies
- Historical tricky
- Sports Reference
- Most of the Sports Reference sites contain Vegas lines
- Does contain the sports books but an aggregated number like the default Cover’s odds
Weather
- Sports Reference
- Most of the Sports Reference sites contain game weather data
- National Oceanic and Atmospheric Administration
- Great historical data
- Hard to automate
- Weather Underground
- Free up to 500 calls per day and 10 per minute
- MLB Weather example
- Code in python and R showing scraping weather with analysis
Paid Services
Daily Fantasy Sports
- Fantasy Labs
- Run by Jonathan Bales (famous DFS player)
- Contains lineup builder, trend analyzer, player ownership, Vegas implied ownership, and other stats
- RotoQL
- Run by Saahil Sud (famous DFS player)
- More focus on expected value than ownership.
General Sports Data
- Sports Data Direct [3]Affiliated with this blog
- Focused more on conventional bets than DFS for now
- Initial release features NFL data
- Cheap subscription
- Not ready yet
- STATS
- By far the biggest name
- Real time data and proven quality
- Costly packages reportedly in 4+figures per season and you have to talk to a sales rep
- Sports Reference
- Comprehensive stats
- Offers data pulls of their sites with a minimum price of $1,000
If I missed something feel free to email or reply in the comments so I can add it
References
↑1 | http://sabr.org/sabermetrics |
---|---|
↑2 | https://www.sports-reference.com/data_use.html |
↑3 | Affiliated with this blog |