Anyone who has visited Ken Pomeroy’s site kenpom.com or read Dean Oliver’s “Basketball on Paper” is familiar with the concept of efficiencies.  More simply put the amount of points scored per possession.  College basketball is a sport that can effectively be modeled using an expected offensive efficiency (points scored per possession) compared against an opponents expected defensive efficiency (points allowed per possession).  Combine that with the number of possessions per team, and you can predict the final score.

home score = home teams offensive efficiency vs visitors defensive efficiency * # of possessions
visitor score = visitors offensive efficiency vs home teams defensive efficiency * # of possessions.

A couple of clarifications are already needed to the above formulas.  We say “vs” between an offensive and defensive efficiency, there are different ways to calculate this.  If a team allows 0.8 points per possession and the opponents score 1.05 points per possession, what do we expect to happen when they play?  The first thought might be to average the two numbers.  This would be incorrect though, because a defensive efficiency of allowing 0.8 points per possession would likely be the best in the league, and assuming that is from playing a variety of opponents that likely have an accumulated offensive efficiency around the league average (~1.02 points per possession, varies year to year), playing an offense that scores slightly better we would not expect anywhere near the 1.05 efficiency they usually produce, we would probably expect somewhere between 0.8 and 0.85 points per possession.  We can bring league average into the equation (simplified for now, but assuming a team plays both good and bad opponents and over time will average out, we will get more accurate later on).  Say an expected offensive efficiency can be added to the opponents defensive efficiency, and then subtract the average efficiency out of it.  So in this case the expected defensive efficiency would be 0.8 + 1.05 (the opponents avg offensive efficiency) – the league average 1.02 which equals 0.83.  We would expect this opponent who scores slightly better than most teams in the league to similar score slightly more against team A.  Similar calculations can be made for offensive efficiency.

The other clarification is finding the exact number of possessions, which is not a stat provided in a typical box score, however can be estimated by counting the number of made shots (including trips to the free throw line), defensive rebounds, and turnovers.

Another flaw with what we have proposed is that teams play other teams of different strengths over the course of the season.  Often times some teams will have a much higher strength of schedule than others, which would tend to lead to lower efficiencies.  Fortunately we have ways of accommodating this.  We can look at the quality of opponents which a team faced and adjust our predicted efficiencies accordingly.  I won’t go into too much detail on how this is done here, the short of it is if we want to calculate team A’s adjusted offensive efficiency, we need to look at the defense of every team, team A has already played and compare it with the national average.  If they have played a weaker than average schedule, we would give them a bump in adjusted offensive efficiency, else we would reduce expectations.  See http://kenpom.com/blog/ratings-methodology-update/ for a full description.

So now that we have an idea of what adjusted efficiencies are, I wanted to explore if these could be utilized to beat the spread and/or totals bet in college basketball.  We learned in an earlier post that most college basketball lines are set extremely close to these adjusted efficiencies.  However maybe there is a large enough discrepancy to exploit some weakness here.  So I designed an experiment to find out.

For this experiment I am using college basketball data collected from the 2003/2004 through the 2014/2015 seasons.  In order to calculate adjusted efficiencies I need a decent sampling of game data each season.  For that reason I am only considering games from January through the end of each season.  I don’t include any preseason rankings, or other prediction based approaches, I want this to be fueled by real data that resets each season, so I exclude the first two months from my simulated bets and use them only as data for calculating adjusted efficiencies.

I would have liked to compare with the adjusted efficiencies directly from kenpom.com.  However, the data presented on that site is constantly changing as the season progresses.  There is no way to go back and view the adjusted efficiencies at a specific point in time.  I want a purely predictive model, so I needed a new approach.  To solve this problem I have decided to calculate my own adjusted efficiencies based in a way as similar to Ken Pomeroy as I can.  For this I calculated the raw offensive and defensive efficiencies, along with the predicted possessions for each game, calculated by:

(Field goal attempts – offensive rebounds) + turnovers + (0.475 * free throw attempts)

Then adjusted for competition as explained above.   So for each game, I looked at team A and every team B it had played prior in that season, and calculated team A’s average offensive efficiency, and adjusted it for each of team B’s defensive efficiency performances up until that point in the season against the national average.  So if team A averaged 1.05 points per possession (more than the league average), but their opponents adjusted defensive efficiency also allowed 1.05 points per possession (also more than the league average), I would adjust team A’s expected offensive efficiency to be the league average (1.02).  Similar calculations were made for the adjusted defensive efficiency.  Note that only games against Division I opponents were included in these calculations.

Before analyzing any results, I cross-checked my results with some of the late-season games each season, as these games should be the closest in comparison in my model to kenpom.com’s predictions.  They were not exact matches, as his model likely weighs other factors such as favoring recent games and possibly considering the site of each game played.  I have yet to find his exact formula for his calculations, however the values I cross-checked were reasonably close.  Each adjusted efficiency averaged to be within a 2% difference with his model, not exact but close enough for now.

Nerd Speak: For this experiment I wrote a C# program to create my model.  I load the raw data from csv’s and store into a sql database.  For each game, I query the database for the home and visiting teams, I load every Division I game played up until that point in the season, calculate the adjusted efficiencies looking not only at every game the home and visitors played, but also each game all of their opponents played in order to determine proper weights for my adjusted efficiency model.  For each game I output a predicted score for both teams and spit out into an Excel spreadsheet.  I use some simple functions in Excel to evaluate how the model did, and visually cross-check that my results seem realistic.

21584 games were used for my analysis.  While there were more applicable games in the January-April time frame for these college basketball seasons, I could only evaluate against games I could find betting lines for.  I had purchased a historical data set, which was mostly complete but had some holes.  My first approach included every game in this data set, evaluating against the closing spread and closing total line for each game.   Here were the results:

Wins: 10617  Losses: 10523  Win %: 0.492

Total bets:
Wins: 10523  Losses: 11061  Win %: 0.488

Unfortunately, these results did not show any advantage.  My next step was to try to conclude why.  Perhaps because I am betting on every game, despite the differential between my prediction and the perceived advantage it might have over the spread.  To test this hypothesis, I decided to only consider games where my predicted score differed from the Vegas spread by 5 points or more, and 8 or more for the totals bet.  Lets look at the results.

Wins: 1149  Losses: 1187  Win %: 49.18

Total bets:
Wins: 1337  Losses: 1450 Win %: 47.97

Again, not the results I was secretly hoping for.  There seems to be no advantage in using adjusted efficiencies the way I have to predict college basketball spreads or totals.  However, it did give me some evidence that my model was fairly accurate at predicting Vegas spreads as 89.2% of the games I predicted the score differential was within 5 points of the spread.  Considering my model does not count for injuries, other day to day lineup adjustments, or any perceived “hot streaks” that may influence the line one way or another I would say It is fairly a good prediction model, but one that is better at predicting Vegas spreads than beating them.

In this experiment we showed there is no easy button in beating college basketball spreads.  We can’t simply plugin kenpom efficiencies and hope to go break the bookies in Vegas.  However this won’t be the last we see of efficiencies, we can break them down into the four factors and look at how teams effective field goal percentage, offensive rebounding, turnovers, and free-throw rate match-up against their opponents, we will also tap into some machine learning approaches to try to dig deeper into understanding how to beat the college basketball spread.  More to come.

# Setting the lines

The last 7 March’s I have spent in Vegas watching basketball with sides of booze and betting.  New teams every year with a lot of the same powerhouses making their annual cameos.  One constant I can’t avoid hearing year after year is how well the lines are set.  Hearing about how great a job the bookmakers do, and what inside knowledge they must have when a game lands within a point or two of the spread.

With 48 games the opening weekend alone (not counting play-in games) odds are bound that some of them are going to finish close.  One of my first realizations is that these lines are not magic numbers pulled out of a hat at some soon to be demolished casino north of the strip, but a mixture of simple math or ripping off of one mans work.

In the regular season there are around 350 D-1 teams, it would take a small army to watch all or even most of the games played in a regular season.  I can assure you that nobody is doing this.  Lines aren’t created from expert analysis having watched hundreds of games, but rather created from some simple math using two teams expected efficiencies adjusting for home or away and any possible injuries.  Fortunately for the bookmakers they don’t even have to do the simple math, as one man does the dirty work for them.  Lets take a look at a couple examples, we will use games played tomorrow to minimize biases.

The predictions are provided by kenpom.com.  A subscription is required for full access to the site, but lets take a look at the predicted scores for these three games.

Butler at home is predicted to win 81-73.  Which translates to a -8 predicted spread.
Xavier at home is predicted to win 79-77.  Which translates to a -2 predicted spread.St. Johns at home is predicted to win 80-71.  Which translates to a -9 predicted spread.

Do you see where I am going with this?  In the first three examples I could find, we can predict what the spread will be within a 1 point margin.  In the past I have done analysis to determine what the difference is between the spread and kenpom’s predictions, and it averages to be slightly less than a 2 point difference.  Game point total predictions can be made in a similar manner with equally as convincing data.  These numbers aren’t being conceived from thin air.  They are simply a calculation of the expected pace (based on season long averages for each team and their opponents), times the expected offensive efficiency of team A vs the expected defensive efficiency of team B and the expected pace times the expected offensive efficiency of team B vs the expected defensive efficiency of team A.  An expected efficiency is just the amount of points scored or allowed on a per possession basis. For a full explanation of how these numbers are generated please read kenpom’s site or Dean Olivers book as their work is based off these concepts.

The point of this article is not to get bogged down in the exact math behind these predictions, we will elaborate on that in the future.  The point is to understand that these spreads are predictable with great accuracy, and we will reference these predictions as a baseline for developing algorithms to attempt to do better at predicting basketball.  More to come.