Anatomy of the March Madness Upset Part 3 – 2017 Picks

For those who have been following along, we have been analyzing the past 15 years of March Madness upsets to try to figure out who are the best candidates to bust brackets.  Catch up on part 1 here  and part 2 here.  In quick recap we saw some common patterns across histories college basketball underdogs, particularly that most of these teams excelled in at least one of the following areas: 2 point percentage, 3 point percentage, defensive turnover percentage, offensive rebounding percentage, or has an elite adjusted defensive efficiency (points per possession allowed adjusted for competition).  Why did we go through all of that work?  To better select candidates for 2017’s tournament, which is quickly approaching.

Lets get right to it, with my March Madness analysis.

Lower Seed Higher Seed Seed AdjEM Diff 2Pt% 3Pt% Def TO% Off Reb% AdjDefEff Synopsis Money Line
Troy Duke 15-2 22.42 50.8 35.5 17.7 30.6 107.3 3000
North Dakota Arizona 15-2 23.06 51.6 28.4 20 26.2 103.2 1427
Jacksonville St. Louisville 15-2 26.14 50.9 36.7 16.8 31.5 104.8 3600
Northern Kentucky Kentucky 15-2 25.93 51.5 34.2 20.1 36.3 109.5 3310
New Mexico St. Baylor 14-3 17.79 54.5 33.4 19.2 36 102.7 candidate 616
Florida Gulf Coast Florida St. 14-3 17.73 55.5 34 17.3 34.5 104.5 candidate 559
Iona Oregon 14-3 19.83 49.3 39.8 18.1 27 106.2 candidate 1050
Kent St. UCLA 14-3 21.22 48.6 31.4 19.3 38.1 102.7 1750
East Tennessee State Florida 13-4 14.9 55.4 38.2 22 30 96.5 candidate 490
Bucknell West Virginia 13-4 18.08 54.6 37.7 19.8 26.8 100.3 candidate 1000
Vermont Purdue 13-4 12.34 55.6 36.7 19.7 30.1 98.6 candidate 358
Winthrop Butler 13-4 16.28 51.2 38 18.7 27 101.8 candidate 500
UNC Wilmington Virginia 12-5 14.28 56.1 36 20.4 32.5 105.4 candidate 353
Princeton Notre Dame 12-5 7.51 50.4 38.1 20.6 24.2 96.9 candidate 253
Nevada Iowa St. 12-5 9.77 49.3 38.5 15.9 30.6 101.2 candidate 236
Middle Tennessee Minnesota 12-5 1.94 53.8 37 20.2 30.2 97 -115
Providence/USC SMU 11-6
Xavier Maryland 11-6 -0.5 52 34 17.8 35.2 99.5 113
Rhode Island Creighton 11-6 3.97 50.5 34 19.5 33.1 95 -102
Kansas St./Wake Forest Cincinnati 11-6

This years crop of 11-15 seeds are particularly interesting to me.  Lets break it down.  First off the 15 seeds, I am not in love with any of these potential upsets this year, all have below average defenses and don’t standout in any offensive categories.  I will be skipping betting these match-ups this year.

Lets talk 11-6 matchups.  We won’t know two of these until the play-in games.  I like both USC and Wake Forest to take care of business and secure spots in the tourney, but I skipped my analysis for these games as they really aren’t upsets, they are larger named schools who see similar quality of opponents all year long.

The classic 12-5 is the first place people typically look for “upsets”.  My usual quarrels with the 11-6 not being upsets have crept into play this year into the 12-5 match-ups.  Particularly Middle Tennessee State, it is a considered a pick’em by Vegas.  The other 3 match-ups are your more traditional 12-5’s, where we typically see an upset 25% of the time.   That being said, I like all of the 12’s this year as potential upset candidates.  If Princeton can string together a run on three’s they have a shot, the Ivy league is well represented in our upset list.  UNC Wilmington shoots the ball extremely well, which they will need to upset Virginia, who is arguably the best defensive team around.  Similarly to Princeton, if Nevada gets hot they become a real upset threat.

The real value bets this year lie in the 13-4 and 14-3 rounds.  These teams get to avoid the Arizona’s, Duke’s, and Louisville’s who could legitimately vie for a one seed.  They are also as a whole really good shooters, probably the most important factor in determining the outcome of a college basketball game, especially an upset.  New Mexico State, Florida Gulf Coast, East Tennessee State, Bucknell and Vermont all shoot at above a 54% clip for their 2 point shots.  Iona and Winthrop are equally deadly from the 3.  The gems of this class are Bucknell, and my two favorites Vermont and East Tennessee State.  All shoot above 54% from 2 and 36% from 3, if they got hot look out.  They also play a little better defense than some of their peers with similar seeds.  Rounding out the 14 seeds, New Mexico State, FGCU, and Iona are all efficient shooters as well.  Kent State although they rebound well, they don’t shoot at an elite level and are likely going to be outmatched by UCLA the nations top offense.

What we didn’t see in this years class of potential upsets, are any teams excelling in creating turnovers or elite on defense.  However we were graced with excellent shooters, and some decent rebounding teams.  I would take field goal efficiency every day.  Anyways there you have it, my 2017 March Madness upset candidates.  Note I say candidates, many of these teams will go on to be blown out, but most will compete, and a couple are going to prevail, hopefully we have narrowed our upset contenders correctly.  There you have it, focus on the 13 and 14 seeds this year, particularly Vermont, East Tennessee State, and Bucknell if they can handle the pressure (see what I did there?)

Advertisements

Anatomy of the March Madness Upset Part 2

Following up on my last post, I wanted to evaluate what common trends we could find from the March Madness upsets of the last 15 years.  Last time we looked at difference between efficiencies, but this time I want to break it down further and look at some individual statistics behind these upsets.  Some interesting patterns arise, but it remains clear there are different play styles involved and there are multiple ways in which these upsets happen.  Lets get right to the data behind these college basketball upsets.

Year Lower Seed Higher Seed Score Seed AdjEMDiff 2Pt% 3Pt% Def TO% Off Reb% AdjDefEff
2016 Middle Tennessee Michigan State 90–81 15-2 -25.68 47.80 39.2 19.4 28.70 98.80
2016 Stephen F. Austin West Virginia 70–56 14-3 -11.38 54.50 36.9 25.9 33.50 95.50
2016 Hawaii California 77–66 13-4 -7.90 54.10 32.2 19.8 30.40 93.10
2016 Arkansas-Little Rock Purdue 85–83 (2 OT) 12-5 -12.18 48.90 38.2 21 26.70 95.00
2016 Yale Baylor 79–75 12-5 -6.07 51.40 36 17.8 39.30 94.90
2016 Northern Iowa Texas 75–72 11-6 -6.97 51.20 37.2 18.5 17.70 97.30
2016 Gonzaga Seton Hall 68–52 11-6 1.61 54.30 37.8 15.1 32.10 94.40
2016 Wichita State Arizona 65–55 11-6 -0.68 49.20 32.3 23.2 31.60 87.60
2015 Georgia State Baylor 57–56 14-3 -13.56 53.40 32.2 23.20 30.00 99.10
2015 UAB Iowa State 60–59 14-3 -20.17 46.60 33.2 19.70 34.20 99.80
2015 Dayton Providence 66–53 11-6 -3.43 52.60 35.60 21.10 23.30 93.00
2015 UCLA SMU 60–59 11-6 -3.37 47.40 36.8 17.90 28.60 96.10
2014 Mercer Duke 78–71 14-3 -18.34 51.60 38.80 19.00 31.90 101.80
2014 Stephen F. Austin VCU 77–75 (OT) 12-5 -9.91 52.30 34.90 23.60 38.12 100.90
2014 North Dakota State Oklahoma 80–75 (OT) 12-5 -8.89 55.40 34.70 17.10 30.60 103.10
2014 Harvard Cincinnati 61–57 12-5 -1.77 49.70 38.70 21.20 32.30 94.70
2014 Tennessee Massachusetts 86–67 11-6 7.57 50.60 31.90 16.90 39.70 93.60
2014 Dayton Ohio State 60–59 11-6 -8.76 50.40 37.70 18.80 34.00 99.00
2013 Florida Gulf Coast Georgetown 78–68 15-2 -18.78 52.30 33.40 22.10 32.50 96.80
2013 Harvard New Mexico 68–62 14-3 -13.76 51.50 39.80 20.90 25.60 99.30
2013 LaSalle Kansas State 63–61 13-4 -4.97 49.30 37.70 21.30 29.00 96.20
2013 Ole Miss Wisconsin 57–46 12-5 -6.42 49.40 32.40 21.50 34.10 93.60
2013 California UNLV 64–61 12-5 -4.53 48.80 30.20 16.80 32.50 92.30
2013 Oregon Oklahoma State 68–55 12-5 -4.56 49.10 33.30 22.00 35.70 88.10
2013 Minnesota UCLA 83–63 11-6 3.98 48.90 33.70 20.20 43.80 93.10
2012 Lehigh Duke 75–70 15-2 -11.88 49.40 34.70 21.30 31.00 96.60
2012 Norfolk State Missouri 86–84 15-2 -29.10 50.50 31.50 19.60 33.80 99.90
2012 Ohio Michigan 65–60 13-4 -7.30 47.70 34.00 26.40 33.90 92.00
2012 South Florida Temple 58–44 12-5 -4.05 49.00 31.20 18.70 34.00 88.10
2012 VCU Wichita State 62–59 12-5 -10.57 45.90 33.40 27.30 33.40 90.80
2012 Colorado UNLV 68–64 11-6 -7.79 48.30 34.60 18.40 29.20 93.30
2012 North Carolina State San Diego State 79–65 11-6 0.65 49.40 35.50 18.60 35.80 95.20
2011 Morehead State Louisville 62–61 13-4 -15.80 50.00 34.20 22.70 41.20 94.70
2011 Richmond Vanderbilt 69–66 12-5 -2.16 49.90 39.00 19.60 28.80 93.40
2011 Marquette Xavier 66–55 11-6 1.43 50.50 34.90 20.60 35.80 93.80
2011 VCU Georgetown 74–56 11-6 -9.96 48.00 37.00 22.10 30.70 95.40
2011 Gonzaga St. John’s 86–71 11-6 -0.29 51.90 36.10 20.80 36.30 91.60
2010 Ohio Georgetown 97–83 14-3 -16.50 46.40 36.50 21.60 31.20 95.60
2010 Murray State Vanderbilt 66–65 13-4 -3.08 54.20 38.10 24.00 39.60 90.40
2010 Cornell Temple 78–65 12-5 -7.23 51.10 43.30 20.90 31.50 97.70
2010 Washington Marquette 80–78 11-6 -2.65 49.50 33.60 22.20 36.60 90.10
2010 Old Dominion Notre Dame 51–50 11-6 -0.71 49.20 31.70 22.60 42.10 87.20
2009 Cleveland State Wake Forest 84–69 13-4 -8.54 47.40 30.40 24.10 33.50 90.20
2009 Wisconsin Florida State 61–59 (OT) 12-5 2.04 47.70 36.00 19.30 31.60 92.70
2009 Arizona Utah 84–71 12-5 -3.01 50.90 38.90 18.00 35.60 98.40
2009 Western Kentucky Illinois 76–72 12-5 -11.79 49.90 37.70 19.70 37.50 99.70
2009 Dayton West Virginia 68–62 11-6 -15.17 46.00 32.80 21.90 37.70 89.60
2008 Siena Vanderbilt 83–62 13-4 -6.49 48.10 38.20 24.00 31.30 96.90
2008 San Diego Connecticut 70–69 (OT) 13-4 -14.22 48.70 33.70 22.90 32.80 93.00
2008 Villanova Clemson 75–69 12-5 -9.31 47.80 34.40 23.40 36.00 91.80
2008 Western Kentucky Drake 101–99 (OT) 12-5 -7.81 51.20 38.90 24.50 36.80 94.00
2008 Kansas State Southern California 80–67 11-6 -0.97 50.20 32.00 22.40 44.30 91.40
2007 Winthrop Notre Dame 76–64 11-6 -6.04 55.10 35.50 20.60 35.40 93.10
2007 VCU Duke 79–77 11-6 -9.20 48.20 40.10 23.80 36.00 97.60
2006 Northwestern State Iowa 64–63 14-3 -11.87 50.50 36.20 24.10 38.20 95.70
2006 Bradley Kansas 77–73 13-4 -7.56 48.00 33.60 23.10 35.50 88.20
2006 Montana Nevada 87–79 12-5 -7.13 54.90 37.00 20.40 33.40 99.50
2006 Texas A&M Syracuse 66–58 12-5 2.44 49.10 36.10 27.30 43.20 87.10
2006 Milwaukee Oklahoma 82–74 11-6 -2.68 48.30 33.70 21.70 38.40 93.00
2006 George Mason Michigan State 75–65 11-6 -0.44 53.80 35.60 20.40 32.20 88.70
2005 Bucknell Kansas 64–63 14-3 -17.76 48.90 36.90 23.70 31.40 92.00
2005 Vermont Syracuse 60–57 (OT) 13-4 -7.43 48.70 35.80 19.40 35.50 94.30
2005 Milwaukee Alabama 83–73 12-5 -8.29 49.90 35.30 24.30 36.70 91.50
2005 UAB Louisiana State 82–68 11-6 -2.15 49.80 34.70 27.40 32.10 93.40
2004 Manhattan Florida 75–60 12-5 -8.59 47.10 36.80 24.00 35.40 91.00
2004 Pacific Providence 66–58 12-5 -9.71 52.50 35.50 19.40 31.10 95.10
2003 Tulsa Dayton 84–71 13-4 -4.42 51.00 36.80 20.30 33.60 91.80
2003 Butler Mississippi State 47–46 12-5 -7.01 53.40 39.10 20.00 28.70 96.40
2003 Central Michigan Creighton 79–73 11-6 -7.84 56.10 38.40 22.40 35.60 96.60
2002 UNC-Wilmington Southern California 93–89 (OT) 13-4 -10.22 46.50 37.30 24.00 33.20 94.00
2002 Creighton Florida 83–82 (OT) 12-5 -15.22 51.20 37.20 22.90 35.70 97.50
2002 Tulsa Marquette 71–69 12-5 -6.67 51.10 40.20 21.10 32.50 98.40
2002 Missouri Miami (Florida) 93–80 12-5 -1.44 49.70 39.10 20.00 39.70 96.70
2002 Wyoming Gonzaga 73–68 11-6 -8.41 50.60 30.90 18.70 35.80 94.60
2002 Southern Illinois Texas Tech 76–68 11-6 -4.81 49.90 36.60 21.60 36.20 93.60

Well that is a lot of numbers.  First off as I mentioned before 11-6’s aren’t great upsets.  So I will focus on the 12 and higher seeds for my analysis.  We see a lot of upsets coming from teams with specialized skill sets.  For instance just last year Middle Tennessee was an elite 3 point shooting team, finishing in the top 5 percent in the NCAA in that category.  Sure enough when they knocked off Michigan St. they finished with 11/19 shooting from behind the arc.  Now this performance may be somewhat of an outlier, but given the opportunity to score 3 points a possession is the kind of stat that is needed to cause this caliber of an upset.  They were not alone in this, when Arkansas Little Rock upset Purdue, also last year, they too were a top 3 point shooting team.  Similarly Mercer over Duke in 2014, Harvard over Cincinatti in 2014, and Harvard over New Mexico in 2013 were all excellent long range shooters.  Not all of these games had great 3 point performances, but it can lead to a more variable outcome, meaning if a team gets hot they may beat teams they otherwise shouldn’t, exactly where an upset in March stems from.

3 pointers are just one of the skill sets that lead can to an upset.  A couple other patterns arose too.  Lets talk about offensive rebounding.  As the game evolves we saw a lot of teams move away from even attempting these rebounds, preferring to get back on defense and prevent the fast break opportunity.  However, those that can own the boards, can get higher efficiency put back shots and not give up the possession can reap the rewards.  Teams like Morehead State in  2011 managed to obtain over 40% of their offensive rebounds that year, which is one of the reasons they were able to upset Louisville.  2010’s Old Dominion upset Notre Dame after owning the boards all season, as did Kansas state in 2008.

Forcing turnovers is another skill that when teams perform at an elite level can cause a March Madness upset.  Some of this may be more match-up dependent, some teams see more opponents that play full court press (a.k.a. West Virginia), and can adjust accordingly, but if it’s an opponents first time encountering this style of play in a while it can be a game-breaker.  We saw this last year when Stephen F. Austin knocked off West Virginia early, forcing 22 turnovers giving them a taste of their own medicine.  Georgia State forced 21 turnovers upsetting Baylor in 2015.  Ohio was also a turnover specialist knocking off Michigan in 2010.

Some other stats to look for include elite 2 point shooting, and overall adjusted defensive efficiency (typically allowing less than 0.9 points per possession adjusted for competition).  What seems to be the common pattern though, is that a team goes above being well-balanced and really excels in at least one of these areas.  I don’t mean in excel in the top 25% of teams, but rather in the top 5-10% among Division 1 teams seems to be what the data shows.  What category seems to matter a little less, but teams that are excellent 2 or 3 point shooters seem to lead to the most upsets.

I looked at a couple other stats where I didn’t see the same patterns.  Particularly I looked at whether or not a team won its conference tournament (which is a bit of misnomer since most of the 12-15 seeds punched their ticket from winning their respective tournaments).  However, even with the bigger schools, this didn’t seem to be of any relevance.  Another area I looked at was how a team performed in its last 10 games leading up to the tournament, but again this didn’t seem to show much.  A lot of the smaller conference schools may have inflated win percentages due to playing lesser competition, it really needs to be considered on a team by team basis and does not seem to work well being generalized across all teams.

Since html isn’t the best format to work with, I have uploaded my excel data here, including highlighting the elite categories for easier viewing.  NCAA Upsets Data

Enjoy, and check back later this week after Selection Sunday for part 3, where I will explore the likely upset candidates for 2017.

Anatomy of the March Madness Upset

I wanted to take a quick look today at all the NCAA tournament “upsets” dating back to 2012.  Note the air quotes around upset, because what we will see from the data is a lot of the 11-6 seed match-ups are not upsets at all.  It has become apparent the tournament selection committee has not been staying up to date with modern stats to help aide their decision, but rather rely on their old faithful RPI.  Although I know they are looking to change that this year, it will be a while before that process takes hold.

For my classification of an upset I am loosely using any 11 seed or higher winning in the opening round.  The data below is limited to the first round of the tournament (ignore whatever the NCAA is calling their playin-game round these days, that does not count as the first round).  The data is compiled dating back to 2002.  Without further adieu, I have compiled all the March Madness upsets below, along with their expected adjusted efficiency margins (Adj EM – the difference in average points scored per vs allowed per 100 possessions adjusted to their competition they played throughout season) coalesced from Ken Pomeroy’s pre-tourney statistics.

Year Lower Seed Higher Seed Score Seed Lower Seed AdjEM Higher Seed AdjEM AdjEM Diff
2016 Middle Tennessee Michigan State 90–81 15-2 3.90 29.58 -25.68
2016 Stephen F. Austin West Virginia 70–56 14-3 14.43 25.81 -11.38
2016 Hawaii California 77–66 13-4 11.73 19.63 -7.90
2016 Arkansas-Little Rock Purdue 85–83 (2 OT) 12-5 12.48 24.66 -12.18
2016 Yale Baylor 79–75 12-5 13.83 19.90 -6.07
2016 Northern Iowa Texas 75–72 11-6 10.11 17.08 -6.97
2016 Gonzaga Seton Hall 68–52 11-6 19.28 17.67 1.61
2016 Wichita State Arizona 65–55 11-6 21.17 21.85 -0.68
2015 Georgia State Baylor 57–56 14-3 9.89 23.44 -13.56
2015 UAB Iowa State 60–59 14-3 2.83 23.00 -20.17
2015 Dayton Providence 66–53 11-6 14.14 17.56 -3.43
2015 UCLA SMU 60–59 11-6 14.17 17.54 -3.37
2014 Mercer Duke 78–71 14-3 7.50 25.84 -18.34
2014 Stephen F. Austin VCU 77–75 (OT) 12-5 10.80 20.71 -9.91
2014 North Dakota State Oklahoma 80–75 (OT) 12-5 12.57 21.47 -8.89
2014 Harvard Cincinnati 61–57 12-5 17.30 19.07 -1.77
2014 Tennessee Massachusetts 86–67 11-6 21.71 14.14 7.57
2014 Dayton Ohio State 60–59 11-6 12.94 21.70 -8.76
2013 Florida Gulf Coast Georgetown 78–68 15-2 3.34 22.12 -18.78
2013 Harvard New Mexico 68–62 14-3 6.93 20.68 -13.76
2013 LaSalle Kansas State 63–61 13-4 13.24 18.22 -4.97
2013 Ole Miss Wisconsin 57–46 12-5 16.53 22.95 -6.42
2013 California UNLV 64–61 12-5 12.68 17.20 -4.53
2013 Oregon Oklahoma State 68–55 12-5 14.82 19.39 -4.56
2013 Minnesota UCLA 83–63 11-6 19.11 15.13 3.98
2012 Lehigh Duke 75–70 15-2 8.95 20.83 -11.88
2012 Norfolk State Missouri 86–84 15-2 -2.43 26.67 -29.10
2012 Ohio Michigan 65–60 13-4 10.77 18.07 -7.30
2012 South Florida Temple 58–44 12-5 11.47 15.52 -4.05
2012 VCU Wichita State 62–59 12-5 12.67 23.24 -10.57
2012 Colorado UNLV 68–64 11-6 8.26 16.05 -7.79
2012 North Carolina State San Diego State 79–65 11-6 13.11 12.45 0.65
2011 Morehead State Louisville 62–61 13-4 6.64 22.44 -15.80
2011 Richmond Vanderbilt 69–66 12-5 14.89 17.05 -2.16
2011 Marquette Xavier 66–55 11-6 17.66 16.23 1.43
2011 VCU Georgetown 74–56 11-6 8.63 18.59 -9.96
2011 Gonzaga St. John’s 86–71 11-6 16.15 16.44 -0.29
2010 Ohio Georgetown 97–83 14-3 7.17 23.67 -16.50
2010 Murray State Vanderbilt 66–65 13-4 14.11 17.19 -3.08
2010 Cornell Temple 78–65 12-5 13.27 20.50 -7.23
2010 Washington Marquette 80–78 11-6 17.46 20.11 -2.65
2010 Old Dominion Notre Dame 51–50 11-6 17.24 17.95 -0.71
2009 Cleveland State Wake Forest 84–69 13-4 11.40 19.94 -8.54
2009 Wisconsin Florida State 61–59 (OT) 12-5 17.58 15.54 2.04
2009 Arizona Utah 84–71 12-5 15.63 18.64 -3.01
2009 Western Kentucky Illinois 76–72 12-5 6.91 18.70 -11.79
2009 Dayton West Virginia 68–62 11-6 9.28 24.45 -15.17
2008 Siena Vanderbilt 83–62 13-4 6.67 13.17 -6.49
2008 San Diego Connecticut 70–69 (OT) 13-4 4.41 18.62 -14.22
2008 Villanova Clemson 75–69 12-5 12.62 21.93 -9.31
2008 Western Kentucky Drake 101–99 (OT) 12-5 13.88 21.69 -7.81
2008 Kansas State Southern California 80–67 11-6 18.51 19.48 -0.97
2007 Winthrop Notre Dame 76–64 11-6 14.26 20.30 -6.04
2007 VCU Duke 79–77 11-6 13.91 23.11 -9.20
2006 Northwestern State Iowa 64–63 14-3 7.04 18.91 -11.87
2006 Bradley Kansas 77–73 13-4 16.01 23.57 -7.56
2006 Montana Nevada 87–79 12-5 8.70 15.84 -7.13
2006 Texas A&M Syracuse 66–58 12-5 15.44 12.99 2.44
2006 Milwaukee Oklahoma 82–74 11-6 11.86 14.54 -2.68
2006 George Mason Michigan State 75–65 11-6 16.24 16.68 -0.44
2005 Bucknell Kansas 64–63 14-3 5.73 23.48 -17.76
2005 Vermont Syracuse 60–57 (OT) 13-4 13.39 20.83 -7.43
2005 Milwaukee Alabama 83–73 12-5 12.70 20.99 -8.29
2005 UAB Louisiana State 82–68 11-6 12.10 14.25 -2.15
2004 Manhattan Florida 75–60 12-5 11.42 20.01 -8.59
2004 Pacific Providence 66–58 12-5 8.61 18.32 -9.71
2003 Tulsa Dayton 84–71 13-4 11.17 15.59 -4.42
2003 Butler Mississippi State 47–46 12-5 15.04 22.06 -7.01
2003 Central Michigan Creighton 79–73 11-6 10.18 18.01 -7.84
2002 UNC-Wilmington Southern California 93–89 (OT) 13-4 10.76 20.98 -10.22
2002 Creighton Florida 83–82 (OT) 12-5 11.31 26.53 -15.22
2002 Tulsa Marquette 71–69 12-5 15.78 22.45 -6.67
2002 Missouri Miami (Florida) 93–80 12-5 14.51 15.95 -1.44
2002 Wyoming Gonzaga 73–68 11-6 11.43 19.83 -8.41
2002 Southern Illinois Texas Tech 76–68 11-6 12.69 17.50 -4.81

Some observations on the March Madness Upset:

As I mentioned before there are a number of 11-6 upsets with an Adj EM Diff either really small or in the favor of the lower seed.

Even large differences between talent can have upsets, we all remember Middle Tennessee destroying a lot of brackets last year by defeating Michigan St. in the opening round.  While many argue they should have been ranked higher than a 15 seed, they still overcame a huge efficiency discrepancy.

What was the biggest upset in March Madness history?   Some contenders from recent memory include the before mentioned Middle Tennessee over Michigan St, Lehigh upsetting Duke in 2012, Mercer upsetting Duke two years later, or going back a few years Dunk City Florida Gulf Coast upsetting Georgetown.  However strictly based on adjusted efficiency differential we can see the biggest upset was Norfolk State knocking off Missouri in 2012.

The average of the adjusted efficiency differential of the upset is -7.51.  Note this means nothing statistically, I was curious so I calculated it.

Anyways chew on that for a bit, we will try to dive in to some other factors that caused these upsets in upcoming posts.

Calculating Home Court Advantage in College Basketball

Home court advantage is a term often thrown around.  It is a common theme across all sports, with varying rates of its influence.  From a logical standpoint, some aspects make sense, not having to travel long distances (or cross time zones for that matter), having engaged fans rooting you on, and the comfort of being in a place you have often played before.  Books like Scorecasting suggest the possibility of other influences, particularly a bias in the referees to make calls that favor the home team.  Whatever the cause I want to explore the reach of home court advantage in college basketball.  While this has been done before, many times, I want to take this opportunity to exploit a powerful tool to help more accurately analyze true home court advantage.

One of the hurdles in calculating home court advantage in college basketball is the way teams schedule opponents.  Typically schools in the power conferences schedule a large percentage of exhibition type games with lesser opponents.  These games are almost never a home and away type setup.  If we were to calculate home court advantage using these games, we would get a lopsided result because the home team most likely always wins, and by a large margin.  For my calculation I want to restrict my data set to teams who play a home and away with each other in the same season.  Fortunately, conference play provides just this data.  The challenge lies in separating these games where two opponents play a home and away from all the others.  We will need some tools to assist us.

One common recurring theme when evaluating data is which tool is the best for the job?   Excel or its Open Office equivalents are often a good choice for tabular data, however so is mysql, or insert your favorite programming language.  I often find myself wanting to write queries against data in a delimited text file (csv), however I don’t want to layout a database schema, connect to a database, and perform inserts in order to do so.  It’s time prohibitive and tedious.  One tool I have found to be particularly helpful is a package called “Q – Text as Data“.  It is a simple command line utility that can run in Windows and Linux and let you query csv’s as if they were mysql tables, using the column headers as the table field names.

Calculating home court advantage in college basketball using Q

Calculating home court advantage in college basketball using Q

Back to our experiment of identifying games with a home and away within a single season.  Lets see how Q can help us.  I am starting with the following data set, which includes all games from 2005 to the 2015 seasons.  You can download it here:  2005-2015-scores.  Lets use some sql via Q to filter this file to the games we care about.  I will show the commands and then offer some explanation below.

q -H -d,
"SELECT AVG(a.teamscore - a.oppscore)
FROM 2005-2015.csv a
INNER JOIN(SELECT teamname, opponent, datestr, seasonyear, site, teamscore, oppscore from 2005-2015.csv WHERE (site = 'H' OR site = 'A')  group by teamname, opponent, seasonyear having count(*) = 2) b
ON a.teamname = b.teamname AND a.opponent = b.opponent and a.seasonyear = b.seasonyear and a.site='H'"

The output of the above, we see teams win by 3.53 points per game in the home leg of the home and away.  Conversely the visiting disadvantage can be calculated as follows:

q -H -d,
"SELECT AVG(a.teamscore - a.oppscore)
FROM 2005-2015.csv a
INNER JOIN(SELECT teamname, opponent, datestr, seasonyear, site, teamscore, oppscore from 2005-2015.csv WHERE (site = 'H' OR site = 'A')  group by teamname, opponent, seasonyear having count(*) = 2) b
ON a.teamname = b.teamname AND a.opponent = b.opponent and a.seasonyear = b.seasonyear and a.site='A'"

The output this time is -3.518, which represents the ppg the visitors lost by.  In this case the value of playing at home vs away is a swing of 7 points.  Home teams win by a margin of 3.53 ppg, while visiting teams lose at a margin of -3.518 ppg, so taking both of these into consideration we get a swing of 7 points (rounding to the nearest whole number).  That is our calculated home court advantage in college basketball.

Ok, so what did we just do?  You notice “q” is the name of the program running, we are passing a couple of parameters to it.
-H tells it to use the first row in the csv as the header, which translates to mysql column names.
-d, tells it that the text file we are passing in is comma delimited (defaults to pipe delimited).
the third option is the the sql to run, explained below.

Lets take a look at what this sql is doing, from the inside out.  You can see in our INNER JOIN we are grouping by teamname, opponent, and seasonyear which will isolate results for each combination of teams within a season.  We want only games where the site is ‘H’ or ‘A’, neutral games and semi-home or semi-away games are identified differently so we can rule out sites where there is not a true home court advantage.  We use only groups having exactly two games, where one game is home and one is away.  We do this to not include any additional times an opponent may have played, likely in the event of a tournament.  Next we join these group results with the original rows to return the original data set filtered to only include the games we care about.  From there we simply take the average of the score differential for each the home and away games to come up with our calculated home court advantage.

So there it is, we calculated a home court advantage of ~3.5 points per game, and a visitors disadvantage of ~-3.5 points per game, giving roughly a 7 point swing for non-neutral sites.  That is your calculated home court advantage in college basketball, courtesy of Q, which can give you an advantage in your analytics arsenal.

Betting College Basketball with Adjusted Efficiencies

hoop

Anyone who has visited Ken Pomeroy’s site kenpom.com or read Dean Oliver’s “Basketball on Paper” is familiar with the concept of efficiencies.  More simply put the amount of points scored per possession.  College basketball is a sport that can effectively be modeled using an expected offensive efficiency (points scored per possession) compared against an opponents expected defensive efficiency (points allowed per possession).  Combine that with the number of possessions per team, and you can predict the final score.

home score = home teams offensive efficiency vs visitors defensive efficiency * # of possessions
visitor score = visitors offensive efficiency vs home teams defensive efficiency * # of possessions.

A couple of clarifications are already needed to the above formulas.  We say “vs” between an offensive and defensive efficiency, there are different ways to calculate this.  If a team allows 0.8 points per possession and the opponents score 1.05 points per possession, what do we expect to happen when they play?  The first thought might be to average the two numbers.  This would be incorrect though, because a defensive efficiency of allowing 0.8 points per possession would likely be the best in the league, and assuming that is from playing a variety of opponents that likely have an accumulated offensive efficiency around the league average (~1.02 points per possession, varies year to year), playing an offense that scores slightly better we would not expect anywhere near the 1.05 efficiency they usually produce, we would probably expect somewhere between 0.8 and 0.85 points per possession.  We can bring league average into the equation (simplified for now, but assuming a team plays both good and bad opponents and over time will average out, we will get more accurate later on).  Say an expected offensive efficiency can be added to the opponents defensive efficiency, and then subtract the average efficiency out of it.  So in this case the expected defensive efficiency would be 0.8 + 1.05 (the opponents avg offensive efficiency) – the league average 1.02 which equals 0.83.  We would expect this opponent who scores slightly better than most teams in the league to similar score slightly more against team A.  Similar calculations can be made for offensive efficiency.

The other clarification is finding the exact number of possessions, which is not a stat provided in a typical box score, however can be estimated by counting the number of made shots (including trips to the free throw line), defensive rebounds, and turnovers.

Another flaw with what we have proposed is that teams play other teams of different strengths over the course of the season.  Often times some teams will have a much higher strength of schedule than others, which would tend to lead to lower efficiencies.  Fortunately we have ways of accommodating this.  We can look at the quality of opponents which a team faced and adjust our predicted efficiencies accordingly.  I won’t go into too much detail on how this is done here, the short of it is if we want to calculate team A’s adjusted offensive efficiency, we need to look at the defense of every team, team A has already played and compare it with the national average.  If they have played a weaker than average schedule, we would give them a bump in adjusted offensive efficiency, else we would reduce expectations.  See http://kenpom.com/blog/ratings-methodology-update/ for a full description.

Betting on college basketball using adjusted efficiencies.

So now that we have an idea of what adjusted efficiencies are, I wanted to explore if these could be utilized to beat the spread and/or totals bet in college basketball.  We learned in an earlier post that most college basketball lines are set extremely close to these adjusted efficiencies.  However maybe there is a large enough discrepancy to exploit some weakness here.  So I designed an experiment to find out.

For this experiment I am using college basketball data collected from the 2003/2004 through the 2014/2015 seasons.  In order to calculate adjusted efficiencies I need a decent sampling of game data each season.  For that reason I am only considering games from January through the end of each season.  I don’t include any preseason rankings, or other prediction based approaches, I want this to be fueled by real data that resets each season, so I exclude the first two months from my simulated bets and use them only as data for calculating adjusted efficiencies.

I would have liked to compare with the adjusted efficiencies directly from kenpom.com.  However, the data presented on that site is constantly changing as the season progresses.  There is no way to go back and view the adjusted efficiencies at a specific point in time.  I want a purely predictive model, so I needed a new approach.  To solve this problem I have decided to calculate my own adjusted efficiencies based in a way as similar to Ken Pomeroy as I can.  For this I calculated the raw offensive and defensive efficiencies, along with the predicted possessions for each game, calculated by:

(Field goal attempts – offensive rebounds) + turnovers + (0.475 * free throw attempts)

Then adjusted for competition as explained above.   So for each game, I looked at team A and every team B it had played prior in that season, and calculated team A’s average offensive efficiency, and adjusted it for each of team B’s defensive efficiency performances up until that point in the season against the national average.  So if team A averaged 1.05 points per possession (more than the league average), but their opponents adjusted defensive efficiency also allowed 1.05 points per possession (also more than the league average), I would adjust team A’s expected offensive efficiency to be the league average (1.02).  Similar calculations were made for the adjusted defensive efficiency.  Note that only games against Division I opponents were included in these calculations.

Before analyzing any results, I cross-checked my results with some of the late-season games each season, as these games should be the closest in comparison in my model to kenpom.com’s predictions.  They were not exact matches, as his model likely weighs other factors such as favoring recent games and possibly considering the site of each game played.  I have yet to find his exact formula for his calculations, however the values I cross-checked were reasonably close.  Each adjusted efficiency averaged to be within a 2% difference with his model, not exact but close enough for now.

Nerd Speak: For this experiment I wrote a C# program to create my model.  I load the raw data from csv’s and store into a sql database.  For each game, I query the database for the home and visiting teams, I load every Division I game played up until that point in the season, calculate the adjusted efficiencies looking not only at every game the home and visitors played, but also each game all of their opponents played in order to determine proper weights for my adjusted efficiency model.  For each game I output a predicted score for both teams and spit out into an Excel spreadsheet.  I use some simple functions in Excel to evaluate how the model did, and visually cross-check that my results seem realistic.

Adjusted Efficiency Betting Results

21584 games were used for my analysis.  While there were more applicable games in the January-April time frame for these college basketball seasons, I could only evaluate against games I could find betting lines for.  I had purchased a historical data set, which was mostly complete but had some holes.  My first approach included every game in this data set, evaluating against the closing spread and closing total line for each game.   Here were the results:

Spread bets:
Wins: 10617  Losses: 10523  Win %: 0.492

Total bets:
Wins: 10523  Losses: 11061  Win %: 0.488

Unfortunately, these results did not show any advantage.  My next step was to try to conclude why.  Perhaps because I am betting on every game, despite the differential between my prediction and the perceived advantage it might have over the spread.  To test this hypothesis, I decided to only consider games where my predicted score differed from the Vegas spread by 5 points or more, and 8 or more for the totals bet.  Lets look at the results.

Spread bets:
Wins: 1149  Losses: 1187  Win %: 49.18

Total bets:
Wins: 1337  Losses: 1450 Win %: 47.97

Again, not the results I was secretly hoping for.  There seems to be no advantage in using adjusted efficiencies the way I have to predict college basketball spreads or totals.  However, it did give me some evidence that my model was fairly accurate at predicting Vegas spreads as 89.2% of the games I predicted the score differential was within 5 points of the spread.  Considering my model does not count for injuries, other day to day lineup adjustments, or any perceived “hot streaks” that may influence the line one way or another I would say It is fairly a good prediction model, but one that is better at predicting Vegas spreads than beating them.

In this experiment we showed there is no easy button in beating college basketball spreads.  We can’t simply plugin kenpom efficiencies and hope to go break the bookies in Vegas.  However this won’t be the last we see of efficiencies, we can break them down into the four factors and look at how teams effective field goal percentage, offensive rebounding, turnovers, and free-throw rate match-up against their opponents, we will also tap into some machine learning approaches to try to dig deeper into understanding how to beat the college basketball spread.  More to come.

 

College Basketball Databases and API’s

data

In order to find competitive advantages to beat spreads and totals we are going to need to either watch hundreds of games or find another way to understand how teams play, and how they play against each other.  The way in this case is to find some data.  An easier said than done task as anyone who has tried to do this can attest.  While the big 3 professional sports have a wide variety of good options both paid and free, men’s college basketball is a lot tougher to hone-in on.  Lets take a look at what is out there.

Stats.com

Maybe the holy grail of sports statistics.  However it comes at a cost.  How much?  You ask.  Well if you have to ask you probably can’t afford it.  From what I can ascertain from Redittors who have called to inquire, the price is well into the 5 figures.  If you have that kind of money lying ar0und you may already be an uber-successful sports better and likely already have the database or API you need.  The fact that prices aren’t listed on the website is probably a large enough of a red-flag that this is going to be prohibitive to any hobbyist or starting out sports bettor.

Sportsradar.com

Provide an API based approach to query the data you need.  They subdivide their packages by sport so you can buy only the data you need.  They offer a vast array of data from box scores, player profiles, and game summaries.  Props to them for being upfront about their pricing, but this will be the biggest barrier to entry.  The cost for the most basic API is $950 a month for college basketball data.  Alternatively the cost for historic data feeds are around $3,000.  They do offer a demo program to get your feet wet with modified (not-real) data so you can write code against their API and not pay until you are ready.  Unfortunately the cost is still a little prohibitive for most casual bettors.

Hoop-Math.com

Finally a resource with a very reasonable $15 a year fee.  They provide some nice breakdowns of both team and individual statistics.  What is intriguing to me is they have a way to measure percentage of shots at the rim, vs 2 pt. jump shots.  Something not readily available in the typical box scores.  Its unclear how this is determined or if these are estimates based off some other data, but is interestingly nonetheless.  This is a source that seems worth exploring further in a future blog post.

FantasyData.com

This site seems to be the cheapest of the API’s I have found.  They provide a free trial to develop your code until you are ready to upgrade to the $499 monthly fee.  Fortunately they only charge this during active months of the season with other months being billed at $79 each (although you can probably cancel and renew again the next year).  Definitely worth exploring if you have the money, but I imagine most do not or will not throw $3,000 a year at this so lets look at whats left.

Kenpom.com

A great site for up to date rankings of teams and their efficiencies.  Not a feed or API based service, its just html pages that you can sort by various statistics some adjusted for the competition each team faces.  At $20 a year its very affordable and provides in depth detail derived from box scores since the 2002/2003 season.  The one caveat is the data is not necessarily static, efficiencies get adjusted in real time and occasionally prior seasons data can be changed due to different algorithms that better try to predict efficiency.  Not a major deal, just something to be aware of.

Web Scraping

The “free” way of obtaining data.  Is it legal?  Is it not?  I am not a lawyer, I offer no advice or recommendation other than recognizing its an option that is readily available.  For programmers, a variety of tools such as cURL, R, or a variety of other programming/scripting languages can be used.  For those less technically inclined Microsoft Excel provides a “Web Query” operation (available in the data tab) that can automatically be used to draw refresh-able data from various html tables found on the web.  We may go into more depth on these various options later on, particularly R as it seems to be the way of the future as far as statistical purposes go and has a lot of packages that can parse data out of html.

In the mean time, if you choose to scrape the web, read the terms of service, and be respectful.  If site’s get hit too often you will likely get your IP blocked if not worse.  More than likely if you make some queries with a reasonable wait between requests, in a semi-random pattern nobody will blink an eye, if you take someones server down that’s another story.

The most notable site you will likely find is ncaa.org which keeps very detailed statistics for all divisions of college basketball.  However the way the data is organized means you will have to make a lot of requests to get a seasons worth of data.  ESPN is another alternative with a similar data structure.

Some of these options are more viable than others, but at the end of the day it comes down to weighing price vs technical ability vs who has the data you want and what you are willing to put in.  At the very least I recommend checking out the rankings at kenpom.com and hoop-math.com.  Combined they are a $35 year investment and can provide some great insight until you are ready to invest more.