Saturday, August 28, 2021

2021 SEASON: BACKGROUND FOR UPCOMING REPORTS

Introduction

I will post reports over the course of the season related to team RPI ratings and ranks and NCAA Tournament prospects.  For the first part of the season, I will base the reports on simulated team ratings and ranks as applied to the full season schedule in order to produce (1) simulated game results for the entire season and (2) accompanying simulated NCAA Tournament at large selections and seeds.  Each week as teams play games, I will replace simulated game results with actual game results.  When team actual RPI ratings become reliable enough for the NCAA to start publishing RPI ranks, I will switch over to using team actual RPI ratings as the basis for simulating results of games not yet played.  Also, at or around that time, I will publish additional information showing which teams are possible contenders for NCAA Tournament at large postions and seeds even though my current simulations show them as not getting at large selections or seeds.

In this quite long post, I will provide detailed background information to explain how I develop the simulation, how to use the information in the upcoming reports, and the information's limitations.  If you have trouble understanding what I have posted, have questions about it, or have suggestions about it, you can email me at cpthomas@q.com.  Especially for coaches who monitor my simulations as we go through the season as a help in evaluating their teams’ prospects, I strongly recommend reviewing what I have written below.

The specific topics I will cover below are:

Simulated Ratings and Ranks

Simulated Game Results

Simulated NCAA Tournament At Large Selections and Seeds

Data, by Team

What If My Top 50 Results Are Different Than the Simulation Calls For?

Refined Likelihood of Getting an At Large Selection

Simulated Ratings and Ranks

My data base covers all regular season and conference tournament games played since 2007.  Since I focus on the information available to the Women’s Soccer Committee when it makes its NCAA Tournament decisions on at large selections and seeds, in other words available to the Committee before the Tournament, my data base does not include NCAA Tournament games.

In addition, although I have the data for the 2020-21 season (that ended in Spring 2021), I do not include it in the data base I use for simulated ratings and ranks and for simulated NCAA Tournament at large selections and seeds.  I exclude it because I consider it unreliable for those purposes due to teams as a whole not having played enough games and not having played enough non-conference and out-of-region games.  (For more on this, see my earlier posts about the 2020-21 season.)  This means that if a team, relative to its past trends, did unusually well (e.g., Rice) or poorly (e.g., Stanford) during the 2020-21 season, it will not show up in my simulated ratings and ranks and simulated NCAA Tournament bracket.  This is unfortunate, but there is no practical way around it.  As I replace simulated game results with actual results and then shift over to using actual RPIs as the basis for simulating results of games not yet played, this problem gradually will disappear.  I also will explain some things that a coach with a team in this situation can do, during the season, to get a picture of where his or her team will stand if it has results more similar to the team’s 2020-21 performance.

To simulate team 2021 ratings and ranks, I look at team rating and rank trends over the period from 2007 through 2019.  I use Kenneth Massey ranking trends rather than RPI ranking trends, as his rankings do a better job of ranking teams from stronger and weaker conferences within a single ranking system and otherwise are at least as good as the RPI.  I use computer generated trend lines to see what a team rank will be next year, if its ranking trend continues.  For each team, I consider a straight-line trend and a polynomial order 2 trend.  A straight-line trend best represents a team that is going in a steady direction: getting better, getting poorer, or staying the same, all at about the same rate over time.  A polynomial order 2 trend best represents a team that seems to have a baseline position where it usually resides, has had a period where it has performed better or poorer than the baseline, and now appears to be returning to the baseline baseline.  In addition, where a head coach has not been with a team since 2007 but has been there at least 4 years, I consider the same two trends but only for the period from 1 year before the coach arrived to 2019.  Once I have the trend lines, I have rules I follow to select a simulated rank for the team:

I must assign a rank based on either the straight-line trend or the polynomial trend, or I can assign the same rank the team had in the most recent data base year, which for this year is 2019.  If the 2019 Massey Rank is from 1 to 30, I determine the straight line trended rank for the current year using a computer-generated straight line trend formula and the polynomial rank using a computer-generated polynomial trend formula.  If either rank is within 5 positions of the 2019 Massey rank, then that will be my simulated Massey rank for the team.  If both are within 5, then the one closest to the 2019 Massey rank will be the simulated Massey rank.  If neither is within 5, then the 2019 rank will be the simulated Massey rank.  This process is the same for teams with 2019 Massey ranks from 31 to 60, except that the rank difference I use is 10 rather than 5.  For teams from 61 to 100 the rank difference is 15.  For teams over 100, the rank difference is 20.

For teams in the coach-arrived-after-2007 group, I have additional straight line and polynomial trends to consider and use these trends if they produce simulated ranks closer to the 2019 ranks than the full-period trends, again subject to the 5, 10, 15, and 20 position rules of the preceding paragraph.

Once I have all teams’ simulated 2021 Massey ranks, I must come up with all teams’ simulated RPI ranks.  To do this, I put the teams in the order of their simulated Massey Ranks from best to poorest and assign RPI ranks from #1 to #342 accordingly.  I then assign each team, as its simulated 2021 RPI rating, the average rating since 2007 of teams that have had that team's simulated 2021 RPI rank.  (For teams new to Division I in 2020 and 2021, I have other rules I follow to assign them ranks and ratings.)

I follow the above process with no exceptions.

Simulated Game Results

To simulate the result of a game, I start with the two teams' simulated RPI ratings and calculate the difference between them.  I then adjust this difference if one of the teams has home field advantage.  Home field, on average across all Division I teams, is worth 0.0148 in relation to the difference between two opponents' RPI ratings.

Once I have computed a game's location-adjusted RPI difference, if it is 0.0133 or less, then I assign the game a simulated game result of a tie.  I do this because at that rating difference level, each team's likelihood of winning the game is only 50% or less.  If the location-adjusted rating difference is greater than 0.0133, then one of the teams has a greater than 50% chance of winning and I assign that team a simulated game result of a win and the other team a loss.

There are limitations to this way of assigning simulated game results.  One limitation is that in real life, there will be fewer ties than the simulation produces.  The other limitation is that in real life, one does not expect a team to win all games in which its likelihood of winning is greater than 50%.  For example, suppose a team plays 10 games with a location-adjusted rating difference in each game that gives it a 70% win likelihood.  In real life, one would not expect the team to win all 10 games, but rather would expect it to win only 7.  Nevertheless, the simulation has it winning all 10 games, since this is the only way to do the simulation in a manner that produces simulated end-of-season RPI ratings and ranks.

Simulated NCAA Tournament At Large Selections and Seeds

I have a complex program I use at the end of the season to generate likely Committee at large selections and seeds based on the Committee's decisions since 2007.  For purposes of simulating the selections and seeds over the course of the season, however, this year I am going to use different method that I believe will be more useful for coaches.  It also is simpler.  The method focuses on four factors:

Team RPI Ratings

Team RPI Ranks

Team Top 50 Results Scores

Team Top 50 Results Ranks 

I developed the Top 50 Results Scores factor about 10 years ago based on my observation that in making at large selections, the Committee seemed to favor teams that had good results -- wins or ties -- against highly ranked opponents, with the Committee decisions heavily slanted towards very good results.  The scoring system I developed is in the following table:


A team's Top 50 Results Score is the accumulated points a team has gained under this table over the course of the season.  Once I have all of these scores, I rank the Top 60 RPI teams in the order of their Top 50 Results Scores in order to get their Top 50 Results Ranks.  I do the Top 50 Results Ranks for only the Top 60 RPI teams because historically, the Committee never has given an at large selection (or a seed) to a team outside the RPI Top 60.

Automatic Qualifiers.  There are 31 Division I conferences, with the conference champion of each entitled to an Automatic Qualifer position in the NCAA Tournament.  The conference regular season champion of three of these is its Automatic Qualifier (the Ivy, Pac 12, and West Coast).  The others have conference tournaments to determine their Automatic Qualifiers.  My full season simulation includes simulated conference tournaments for these conferences.  My program generates conference standings based on the season's simulated game results and then generates conference tournament brackets accordingly.  (I have done my best to figure out what each conference's tournament bracket format will be this year.  It is possible that some of my bracket formats are not exactly what the conference will do.  I will make corrections as information about the bracket formats becomes available.)  My program also generates simulated conference tournament game results, using the process described above.  By doing this, the program identifies a simulated Automatic Qualifer for each conference.

At Large Selections.  After identifying the simulated Automatic Qualifiers, I next identify simulated At Large Selections.  Since the bracket consists of 64 teams of which 31 are Automatic Qualifiers, that leaves 33 At Large Selection spots for the Committee to fill.

Following the 2019 season, I did a study based on the 2007 through 2019 seasons to determine the importance of each of a number of factors in the Committee's at large selection and seeding decision-making process.  Some of the factors were individual factors (such as RPI rating) and others were paired factors (such as RPI Rank and Top 50 Results Rank paired together, weighted by formula at 50% each).

From 2007 through 2019, the Committee made 435 at large selections.  My study showed that the paired RPI Rank and Top 50 Results Rank factor is the most powerful indicator of the Committee's decisions.  Indeed, if the Committee simply had used that paired factor as the basis for its decisions, 408 of its at large selections would have been the same as the at large selections it actually made -- in other words, all but about 2 per year on average.  This strongly reinforces what I advise coaches when it comes to scheduling:  One must schedule with two objectives in mind -- having a good RPI rank and having good results against highly ranked opponents.

Based on this, for the simulated at large selections I do over the course of the season, I simply select, from the Top 60 RPI teams that are not automatic qualifiers, the 33 teams that that have the best scores for the paired RPI Rank and Top 50 Results Rank factor.  (The formula for this factor is: RPI Rank + (1.0261 x Top 50 Results Rank).)

#1 Seeds.  For #1 seeds, the most effective single factor is teams' RPI Ratings.  Over the period from 2007 through 2019, the Committee selected 52 #1 seeds, of which all but 6 were consistent with teams' RPI Ratings.  This factor thus misses about 1 #1 seed every other year.  It is what I will use for simulating #1 seeds.

#2 Seeds.  For #2 seeds, I will use teams' RPI Ranks.  Over the study period, for the 104 #1 and #2 seeds bunched together, this factor was consistent with all but 11 of them, thus missing about 1 per year.  Thus once I have identified the #1 seeds, my simulated #2 seeds will be the four highest RPI Rank teams that did not get #1 seeds.

There is a paired factor that does slightly better than RPI Ranks -- RPI Rating paired with Common Opponents Results Rank (using a scoring system I developed) -- but its calculation is complex and it will be easier to use RPI Ranks over the course of the season.

#3 Seeds.  For #3 seeds, I likewise will use teams' RPI Ranks.  Over the study period, for the 156 #1 through #3 seeds, this factor was consistent with all but 24 of them, thus missing about 2 per year.  Thus once I have identified the #1 and #2 seeds, my simulated #3 seeds will be the four highest RPI Rank teams that did not get #1 or #2 seeds.

#4 Seeds.  For #4 seeds, I again will use teams' RPI Ranks.  Over the study period, for the 208 #1 through #4 seeds, this factor was consistent with all but 28 of them, thus missing about 2 per year.  Thus once I have identified the #1 through #3 seeds, my simulated #4 seeds will be the four highest RPI Rank teams that did not get #1 through #3 seeds.

You probably have noticed, that saying it differently, my simulated seeds over the course of the season will follow the RPI rankings for the Top 16 teams.  This is consistent with RPI Ratings and Ranks being the best single predictor of seeds.

Data, by Team

My weekly reports will include the following information.  In the reports here, I will provide the information for the Top 100 teams.  In addition, I will provide a Google Docs link to a spreadsheet providing the information for all teams.

Simulated Record:  Wins, Losses, Ties

Simulated RPI Element 1:  Winning record  (Wins + 0.5 x Ties)/(Wins + Losses + Ties)

Simulated RPI Element 2:  Average of Opponents' Winning Percentages

Simulated RPI Element 3:  Average of  Opponents' Opponents' Winning Percentages

Simulated Adjusted RPI:  (Element 1 + 2 x Element 2 + Element 3)/4, adjusted with NCAA bonuses and penalties for certain good or poor non-conference results

Simulated Adjusted RPI Rank

Simulated Total Top 50 Results Score

Simulated Total Top 50 Results Rank

Simulated paired RPI Rank and Top 50 Results Rank factor score

Simulated paired RPI Rank and Top 50 Results Rank factor rank

Simulated paired RPI Rank and Top 50 Results Score factor score

The next to last of these is the paired factor I use to simulate At Large Selections.

What If My Top 50 Results Are Different Than the Simulation Calls For?

If a coach wants to see how his or her team's NCAA Tournament prospects will change if the team has different Top 50 Results than the simulation calls for, there is a way for the coach to do it.  This will involve using the paired RPI Rank and Top 50 Results Score factor score, which is the last item on the preceding list.

Here are instructions for how to do this:

1.  Determine the Proposed Top 50 Results Changes.

a.  With the team schedule in hand, use the simulated RPI numbers to see what game result the simulation has called for in each game on the schedule.  This will produce the overall win-loss-tie record indicated by the weekly report.

b.  Decide what different results against Top 50 opponents you want to try out.

2.  Determine What Your Revised RPI and RPI Rank Will Be.

a.  Based on the different Top 50 results you decided on, determine what your resulting win-loss-tie record will be.  With this revised record, calculate your revised RPI Element 1.  The formula for this calculation is under Information for Teams, above.

b.  Determine what your revised RPI will be.  To do this, use your revised RPI Element 1 and your unchanged Elements 2 and 3 (which will not change significantly even though your game results have changed).  The formula for this calculation likewise is under Information for Teams, above.  Don't worry about the adjusted RPI bonuses and penalties, they are not significant enough to make a difference for purposes of this process.

c.  Look to see where your revised RPI will put you, in the overall RPI Rank list.

3.  Determine What Your Revised Top 50 Results Score and Rank Will Be.

a.  Use the above scoring system table to see what your team's revised Top 50 Results Score will be.

b.  Look to see where your revised Top 50 Results Score will put you, in the overall Top 50 Results Rank list.

4.  Calculate Your Revised Paired RPI Rank and Top 50 Results Rank Factor Score.

a.  The formula for this is RPI Rank + (1.0261 x Top 50 Results Rank).

5.  See Where Your New Revised Paired RPI Rank and Top 50 Results Rank Factor Score Will Put You, in the Overall List for That Paired Factor Score.

a.  If you are among the Top 33 that are not Automatic Qualifiers, this means you are likely to get an At Large Selection.

Refined Likelihood of Getting an At Large Selection

It also is possible to do a refined calculation of a team's likelihood of getting an at large selection:

1.  From a weekly report, find the team's paired RPI Rank and Top 50 Results Score factor score.

2.  With that paired factor score, use the following table to determine the team's approximate likelihood of getting an at large selection:


No comments:

Post a Comment