Page 1 of 2

Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 12:49 pm
by Hall2012
So since I'm currently studying analytics, I decided to have some fun and create a multiple regression model to project the number of at-large bids a conference will get for the NCAA Tournament.

Based on the model, if non-conference play ended today, the Big East would be projected to get 9 at-large NCAA Bids! (9.319939757 to be exact)

Unfortunately, non-conference play does not end today and that number pretty much hinges on us keeping up a .9394 winning % as a conference. But it still shows just how good we've been as a whole to this point.

For any stat-heads out there, I've got some more info on my results below. Feel free to critique and offer any suggestions to try to improve on the accuracy.
____________________________________________________________________________________________________________________________________

My result was this simple (or not so simple equation):
Teams in Touney = 49.83465(win%^2)-62.174(win%)+50.88886(NonConRPI)+.062851(#Teams^2)-1.23931(#Teams)

Win%= conference winning percentage in non-conference play
NonConRPI= Conference RPI excluding conference play (from CBS RPI)
#Teams= # of Teams in the conference

I used data from every conference with at least 1 at large bid over the past 5 years, and on only 2 instances did the model's projection on a conference miss by more than 1 bid (rounding to integers).

R-Square: .95151
F value: 12.2907
F significance: 159E-30
MAE: .73466
MSE: .53973

Some sources of error include likely at-large bid candidates reaching the tournament by way of automatic bids, the change in the number of available at-large bids due to the increase in field size to 68, bubble teams playing their way out of the tournament during conference play, and consideration for the human element in the actual selection process.

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 1:36 pm
by Doge McDermott
Congrats, your post convinced me to stop lurking and sign up.

I guarantee you overfit your model. That rsquare is way too high. What kind of regression did you use? And more importantly, what software did you use?

EDIT - Also, why did you exclude conferences that didn't get a bid?

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 2:07 pm
by redmen9194
I was told there would be no math...

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 2:38 pm
by HoosierPal
redmen9194 wrote:I was told there would be no math...


Best post of the day!! :lol:

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 2:41 pm
by Chalmers0
Doge McDermott wrote:Congrats, your post convinced me to stop lurking and sign up.

I guarantee you overfit your model. That rsquare is way too high. What kind of regression did you use? And more importantly, what software did you use?

EDIT - Also, why did you exclude conferences that didn't get a bid?


Was going to say the same thing regarding the overfit and rsquare.

Also very curious to hear the answer to the regression type/software used.

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 3:13 pm
by Hall2012
Doge McDermott wrote:Congrats, your post convinced me to stop lurking and sign up.

I guarantee you overfit your model. That rsquare is way too high. What kind of regression did you use? And more importantly, what software did you use?

EDIT - Also, why did you exclude conferences that didn't get a bid?



No fancy tools, I just did this quickly using multiple linear regression (admittedly not the best for predictive analytics) in Excel. The only other capable software I have is Mintab 17, which I'm not the biggest fan of- maybe I just need to get more familiar with it. It's far from perfect, my mean absolute residual is 0.735, so there's a lot of +/-1, but it splits pretty evenly to make the average residual -0.00035. Hence the extremely high rsquare I guess?

I'm pretty new at this, but just wanted to play around and see what I got.

As far as excluding the 0 at large bid leagues, it was a mix of time/laziness and thinking that adding 20 zero bid leagues (which are generally the same leagues every year) each year regardless of win%, etc. wouldn't add much to my model (I could definitely be wrong about that). Mainly, one of my factors is conference size and with all of the realignment going on, I didn't feel like counting the number of teams in each conference each year (about an additional 20 conferences a year) when I didn't think it would add much.

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 3:17 pm
by FormulaX
Image

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 4:01 pm
by Doge McDermott
Hall2012 wrote:No fancy tools, I just did this quickly using multiple linear regression (admittedly not the best for predictive analytics) in Excel. The only other capable software I have is Mintab 17, which I'm not the biggest fan of- maybe I just need to get more familiar with it. It's far from perfect, my mean absolute residual is 0.735, so there's a lot of +/-1, but it splits pretty evenly to make the average residual -0.00035. Hence the extremely high rsquare I guess?

I'm pretty new at this, but just wanted to play around and see what I got.

As far as excluding the 0 at large bid leagues, it was a mix of time/laziness and thinking that adding 20 zero bid leagues (which are generally the same leagues every year) each year regardless of win%, etc. wouldn't add much to my model (I could definitely be wrong about that). Mainly, one of my factors is conference size and with all of the realignment going on, I didn't feel like counting the number of teams in each conference each year (about an additional 20 conferences a year) when I didn't think it would add much.


Do yourself a favor and download R. I linked the R-Studio. It's probably the easiest interface to learn R. R is free, it's fairly easy to learn, and it's what I use in grad school.

I don't want to get too deep into the weeds, but try running a cross-validated model or a lasso regression to fix your overfit.

What would happen if you created flags for the auto-bid teams? I imagine there should be some correlation between number of wins and the auto-bid.

Finally, there might be some value to including the 0-bid conferences. It's better to have all information as opposed to just the stuff that confirms your hypothesis. I get it though. Cleaning data sucks.

Granted, all of this advice is made without looking at any of the data. Sounds like an interesting project.

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 5:34 pm
by DudeAnon
I remember having to do a project my senior year and had to write a program that could be used in a business situation. I waited till the last minute than decided to analyze international aid based on a bunch of different parameters. I had taken 2 statistics classes (very hard stuff) so I knew the basics. I threw up a bunch of regression tests that were completely made up and none were the wiser.

Re: Projected # of NCAA Tourney Teams

PostPosted: Tue Nov 25, 2014 5:49 pm
by Hall2012
Doge McDermott wrote:
Hall2012 wrote:No fancy tools, I just did this quickly using multiple linear regression (admittedly not the best for predictive analytics) in Excel. The only other capable software I have is Mintab 17, which I'm not the biggest fan of- maybe I just need to get more familiar with it. It's far from perfect, my mean absolute residual is 0.735, so there's a lot of +/-1, but it splits pretty evenly to make the average residual -0.00035. Hence the extremely high rsquare I guess?

I'm pretty new at this, but just wanted to play around and see what I got.

As far as excluding the 0 at large bid leagues, it was a mix of time/laziness and thinking that adding 20 zero bid leagues (which are generally the same leagues every year) each year regardless of win%, etc. wouldn't add much to my model (I could definitely be wrong about that). Mainly, one of my factors is conference size and with all of the realignment going on, I didn't feel like counting the number of teams in each conference each year (about an additional 20 conferences a year) when I didn't think it would add much.


Do yourself a favor and download R. I linked the R-Studio. It's probably the easiest interface to learn R. R is free, it's fairly easy to learn, and it's what I use in grad school.

I don't want to get too deep into the weeds, but try running a cross-validated model or a lasso regression to fix your overfit.

What would happen if you created flags for the auto-bid teams? I imagine there should be some correlation between number of wins and the auto-bid.

Finally, there might be some value to including the 0-bid conferences. It's better to have all information as opposed to just the stuff that confirms your hypothesis. I get it though. Cleaning data sucks.

Granted, all of this advice is made without looking at any of the data. Sounds like an interesting project.


Thanks for the advice. Is this something you do for your career? I'm in my first semester in a Business Analytics grad program, so I'm still pretty inexperienced with this, but I was really playing around with these numbers to see what I could learn about them. Any advice/constructive criticism is very much appreciated!