Historical Schedule Strengths

Background

In my most recent post, I investigated what would be the best generally applicable “Strength of Schedule” metric. I settled on something I’m calling “Average Rank Difference” which is just the sum of all opponent “ranks” minus the sum of all partner “ranks” plus a few adjustments to normalize between events. Now, what we decide to use for ranks could be anything, but it has to be something you have available before the event starts, so you can’t use each team’s seeding rank at the end of the event you are analyzing for example. Possible options could be: Team number, previous events/years winning record, previous district points, previous event ranks, Elo, OPR, etc… For the analysis in this article, I’m going to be using Elo since I have it readily available, and I think it’s generally better suited for questions like this than any of the alternatives.

Introduction

What I want to do now is to take this metric and apply it to every schedule of the 3v3 era of FRC and see what we can learn. Normalizing across years/events is difficult, but I’ve tried my best to create a metric that works reasonably well in all cases. There are a few data points I’m filtering out though. The first is teams/events that have an unreasonably low number of matches. The only event in this category is the 2010 Israel Regional, which had only 3-4 matches per team. There are also two teams that only had one quals match at an event that I’m throwing out, those are teams 46 at 2009 Granite State and 3286 at 2014 Central Washington University. I suspect they didn’t actually compete at these events. The other set of schedules I’m throwing out are the 2015 schedules. The nature of Recycle Rush made it so you really didn’t care almost at all who your opponents were in quals (and maybe even desired better opponents for the coop points). Strength of schedule was still very important this year, but it means something completely different than in years with W/L games. One last caveat, surrogate data doesn’t exist prior to 2008, so I’m opting to treat surrogates just like normal teams in all years for a consistent comparison. With all of that out of the way, let’s get into the results!

Results

I’ve uploaded a book titled Historical_Schedule_Strengths here. It contains two sheets, one that shows individual schedule strengths (higher numbers mean harder schedules and 0 is neutral), and one that shows event standard deviations in schedule strengths (lower standard deviation means schedule is more balanced). Play around with it, and let me know if anything looks horribly wrong.

Best/Worst Team Schedules

According to this methodology, here are the 10 hardest schedules of the 3v3 era:

worst schedules.PNG

And here are the 10 easiest:

best schedules.PNG

All entries on this list come from events which have either a large number of teams or a low number of matches per team. This makes sense because it is easier to get great/awful schedules in these scenarios, but more on that later. I find it really interesting that both the best and the worst schedules according to this methodology come from MN events within the past few years. I can try to offer some perspective on them since I was at Lake Superior 2016 and know the teams at North Star 2017 pretty well. Here’s a breakdown of 5653’s partners and opponents at Lake Superior 2016:

Capture.PNG

I don’t know if this is actually the worst schedule of all time, but I think it has to be the dumbest schedule I’ve ever personally seen. 4009, 2052, and 359 were a step above everyone at that competition, and 5653 had to face all of them in addition to a huge cohort of other really good teams who made the playoffs. Furthermore, their only partners that ended up making the playoffs were a low captain and a pair of second round picks. It should be noted that you play against 1.5 times as many teams as you partner with, but even so, 5653 had to play against 10X as many captains/first round picks as they got partnered up with.

On the other end of the spectrum, we have 3130 in 2017. 3130 was so good in 2017 that I didn’t bat an eye when I saw that they seeded first at this event, but let’s do the same analysis of their schedule:

3130.PNG

This one looks amazing for 3130. They didn’t get paired with the 1 captain because they were the 1 captain, but they still got a whole bunch of really great teams to work with. Even among the teams that didn’t make playoffs, there was a noticeable difference in team quality in favor of their partners, as I recognized many of their other partners as being solid teams.

Most/Least Balanced Events

Likewise, here are the 10 least balanced events:

least balanced events.PNG

And here are the 10 most balanced events:

most balanced events.PNG

Unsurprisingly, the most balanced events all have a very high number of matches. The least balanced events all have either a low number of matches per team or a high number of teams. Here’s a graph which shows an event’s schedule “balance” versus the number of matches per team:

schedule balance versus matches.PNG

What this indicates is that going from 7 to 12 matches makes your schedule have about half as much variance on average. An exponential fit for this graph probably would make more sense if we had a bigger range, but a linear fit works just fine since the range is limited. Also, I wasn’t sure if I wanted to use stdev or variance, so I went with stdev even though I personally kind of like variance better.

The other way I broke down the data was by year, here is a graph showing the average event schedule stdev for each year.

schedule strengths by year.PNG

So the schedules generally have been getting better over time, I’m not sure what all the causes are, but my speculation is that the algorithm was worse pre-2008 (see the 2007 algorithm of death), and since 2009, the trend has been toward smaller events with more matches per team, which generally increases the overall schedule fairness as seen above. I’m also not sure if the number of attempted schedules created by the scorekeepers has remained constant since 2008, it’s possible that with the better computing power available now that more possible schedules are generated at each event which means better ones get selected.

Conclusions

This was a lot of fun to work on. “Strength of Schedule” is a term that gets tossed around quite a bit, but I think it means different things to different people. I enjoyed taking my best crack at quantifying it and applying it to every year. I also finally got around to fulfilling Travis’ wish 9 months later. Now the next step is to use this metric to create “balanced” schedules of some sort. So stay tuned for that.

 

May your dreams be filled with graphs,

Caleb

 

Update: Fixed the pick numbers on 5653’s schedule summary image

 

 

Finding the Best Strength of Schedule Metric

Background

A few months ago, I kicked off discussion about how to define Strength of Schedule for FRC, and introduced a new metric to take my best shot at quantifying it. Well after a long hiatus, I’m back at it looking at strength of schedule metrics. To summarize, what I am looking for is a metric which is forward-looking, year-independent, and mirrors as much as possible what we colloquially mean when we say a schedule is “good” or “bad”. I want these three properties for the following reasons:

  1. Forward-looking means that I want to be able to tell before the matches take place whether the schedule is good or bad. There are lots of easy backward-looking metrics we could use (that is, we evaluate the schedule strength after the event based on observed performance at that event), but such metrics cannot applied to judge a schedule right when it is released, which is the moment in time we most want to evaluate schedule strength. Furthermore, such metrics could not be used to generate “balanced” schedules which is a long-term goal of mine.
  2. Year-independence means that the metric we use is broadly applicable in any FRC game, provided the general 3v3 structure remains the same. This is important because I don’t want to have to re-do all of this work every year, I want something that has as few assumptions as possible.
  3. Matches our colloquial definition of schedule strength means that the metric has properties we would expect it to. For example, we expect that schedules with more matches will tend to be fairer than schedules with fewer matches. We also expect to be able to look at the “worst” schedules at a glance and recognize why they are so bad and vice versa for the “best” schedules. If we don’t have these properties, we are probably not measuring anything useful.

With this in mind, I have developed 7 candidate forward-looking metrics and will be sharing with you my analysis of which one(s) have the most general value moving forward, particularly for the future development of “balanced” schedules.

Candidate metrics

Here are the seven candidate metrics, as well as their descriptions. I just made up all of these names, so sorry if you don’t like them:

  • Caleb’s Strength of Schedule: A detailed description of this metric can be found in these two posts. However, I’ve made a slight change since then. Essentially, this metric is the probability that the given schedule will end up being better than a random schedule for your team according to my event simulator. The formula in my second link has been slightly modified to the following:

Capture.PNG

This changes it so that the average schedule is now 50%, and teams who have the first seed locked now have strengths around 50% instead of 100%.

  • Expected Rank Change: This metric also uses my simulator, but instead of all the crazy math of the previous metric, this is simply the given team’s average rank using random schedules subtracted from their average rank using the given schedule, and then divided by the number of teams at the event. In addition to the simplicity, a large advantage of this metric is the fact that, since it uses my simulator, it factors in the bonus RPs, which none of the remaining metrics will have the capability to do.
  • Average Elo Difference: This metric is the average found by summing all of the opponent Elos in each match and subtracting all partner Elos from each match, and then subtracting the average event Elo (to allow comparison between events). Pretty straightforward.
  • Expected Wins Added: Again using Elo, this metric is the expected percentage of wins the schedule would add to an average team at the event. So a value of 0 indicates that an average team would be expected to win 50%, a value of 0.4 (40%) would indicate that an average team is expected to win 90% of their matches.

The following 3 metrics are all found by sorting all of the teams entering the event by their Elo rating, and then only comparing these Elo “ranks”, and not actual Elo values.

  • Average Rank Difference: This metric is found by averaging the sum of all opponent ranks and subtracting out all partner ranks, and then also subtracting ((# of teams + 1)/2) and then dividing by the number of teams in order to compare between events.
  • Weighted Rank Difference: Very similar to the above metric, except this one is found by weighting partners more heavily than opponents. So it is the sum of all opponent ranks minus (3/2)*(sum of partner ranks), and then divided by the number of teams at the event. This is because you will always have 1.5 times as many opponents as partners.
  • Winning Rank Matches: This metric is found by treating each match as a binary event. Either it is a “winning” match or a “losing” match depending on your partner and opponent ranks. If the sum of your partner ranks + ((# of teams + 1)/2)  is lower than the sum of the opponent ranks, it is considered a winning match for an average team, otherwise it is a losing match. This sum is then divided by the number of matches played and then 0.5 is subtracted from it in order to allow comparison between events.

Results

I have uploaded a file to my Miscellaneous Statistics Projects paper titled “2018_schedule_strengths_v4” which contains these metrics for all FRC teams at all 2018 events. According to 5 out of 7 metrics, 2220 on Archimedes was the best 2018 schedule. There is less agreement between the metrics on the worst schedule, but 2096 on Hopper was in the top 10 by all metrics for the worst schedule. I posted a brief comparison of 2220’s and 2096’s partners and opponents here. It seems clear to me that we are on the right track based on this validation, as both schedules are clearly very extremely “good” and “bad” respectively.

Determining the Best Metric

Now, with the results in hand, let’s determine which of these metrics are best to use going forward. All 7 are forward-looking, so we can’t winnow down any options based on that. They also all at least roughly meet our colloquial definition of “schedule strength” based on some simple validation of the best and worst schedules above. However, we can determine which metrics better meet this criteria by examining the correlations between them. Here is a chart showing the correlation coefficients between each of the metrics:

There is very clearly one of these which stands out from the others, and that is the “Expected Wins Added” metric. This metric is dominated by every other metric, meaning that the correlations between it and any third metric will always be less than the correlation between the alternative and the same third metric. This means that “Expected Wins Added” is probably not capturing the colloquial definition of “schedule strength” as well as the other metrics do. Although note that the correlation is indeed still well above zero, which means this metric is not completely useless, just not as well suited as the alternatives for what we are looking for. In a similar way, “Winning Rank Matches” is clearly a step below all other options, so let’s throw out that metric as well. Removing these two gives us the new correlation chart:

There are no longer any obvious candidates to remove based on their correlations. What we see instead are 3 groups of metrics. Group 1 is “Caleb’s Strength of Schedule” and “Expected Rank Change”. These metrics are understandably very strongly correlated since they are both direct outputs of my event simulator, and factor in bonus RPs and other team attributes not found in the other Elo-based metrics. Group 2 contains “Average Rank Difference” and “Weighted Rank Difference”. These metrics are understandably very correlated, since the opponent calculations are equivalent, and the partner calculations are only different by a factor of 1.5. Group 3 contains “Total Elo Difference”, which has slight attributes of both Group 1 and 2, and thus has intermediate correlations with all other metrics.

So we can’t easily eliminate any of these based on criteria 3, but fortunately we can eliminate some based on criteria 2, year-independence. I personally think my simulator is an incredible tool for this work (but I’m biased :)), however, what it certainly is not is year-independent. There’s a lot of 2018-specific features in it. So both “Caleb’s Strength of Schedule” and “Expected Rank Change” should be thrown out by criteria 2. But you might ask “why include them at all if you were planning to throw them out from the start?” Well for one, I wanted to see how they would compare to the others, because I still think they are excellent for finding schedule strengths in 2018, and I think the results have shown that. More importantly though, we can still look at the correlations with other metrics even if we weren’t planning to use it.

In a similar vein, we should also throw out “Total Elo Difference”. Elo is really cool (again I’m biased) but it is not widely used or accepted in the broader FRC community relative to things like event ranks or District Points. Both of the latter can easily be substituted for the Elo ranks used for “Average Rank Difference” and “Weighted Rank Difference”, but trying to map those onto something like Elo ratings would get messy very quickly.

So we’re left with just “Average Rank Difference” and “Weighted Rank Difference”. “Average Rank Difference” has the benefit of being simpler to explain and understand. “Weighted Rank Difference” is slightly harder to explain, but it does correlate marginally better with the output of my event simulator. I believe the higher correlation of “Weighted Rank Difference” comes from the fact that individual partners should be weighted higher than individual opponents due to their effect on the bonus RPs. Good opponents can cost you the win, but good partners can both help you win and help you to achieve bonus RPs. Both of these options are good choices and I can understand using either.

Final Thoughts

My personal choice moving forward though will be to use “Average Rank Difference”. Current students have never experienced a game without the bonus RPs, so they might not realize that this ranking structure is actually a recent phenomenon in FRC. I am not yet convinced that the 2RP win + 2 bonus RPs for separate game tasks formula will hold into the future, so I think it makes more sense at this point in time to weight all partners and opponents equally for strength of schedule, and not assume we will continue having bonus RPs indefinitely. If the GDC continues this pattern for a few more years I will re-evaluate, but that is where I stand for now.

That’s all I’ve got for now, but I’m not done yet. It’s one thing to complain about existing schedules, but my next step is to use this metric to actually generate “balanced” schedules. I’d also like to go back in time and apply this metric to all previous 3v3 games so that we can see the best and worst schedules of all time, I want to see in context how awful the 2007 algorithm of death actually was.

 

Update: Fixed the incorrect formula image for “Caleb’s Strength of Schedule”, corrected bad link

Chezy Champs Ranking Projection Comparison

Introduction

Hello everyone, I’m happy to make my inaugural post on my FRC Statistics Blog. Recently, I’ve noticed that much of my work in my Miscellaneous Statistics Projects thread has essentially become a blog. What I mean by that is that I spend a lot of time writing and editing these posts and was in a sense breaking the “forum” style of CD since I make well over half the posts in those threads. I also wanted more formatting tools at my disposal, so I decided to write in a blog. The plan for now is for me to use this platform for some of my thoughts which would otherwise go on CD. I appreciate feedback, so if you have any thoughts or suggestions on what I can improve in this format, please let me know. Without further ado, let’s get into it.

Background

A few weeks ago, I hosted the Chezy Champs Ranking Projection Contest. I’ve done match prediction contests before, but with my new ability to project ranks before events start, I thought it might be fun to see how my ranking projections stack up to others’. Ian H won that contest, but Ari was kind enough to share a full probability distribution for each team to achieve each rank, so I wanted to compare my ranking distributions against Ari’s to hopefully identify weaknesses in my event simulator. I’ve uploaded a book titled “CC Ranking Comparison” to CD which does this. In addition to Ari’s predictions and my predictions, I’ve added a baseline set of predictions under the name “Ignoramus”. These predictions simply predict that every team has a 1/42 chance of achieving every rank.

Detour on Error Functions

For each of these distributions, I found the squared error and the log loss error for every prediction. You all may be familiar with squared errors from my previous work, but I think I am going to be transitioning to log-loss for much of my work going forward. Essentially, the log loss error is just -LOG(1-ABS(predicted probability – actual result)). Here’s a graph that shows the difference between the two error functions (note that I scaled down the log function by a factor of 3 to make the graph more readable):

 

As you can see, the graphs are reasonably similar for predictions everywhere except the very poor predictions. You can’t see it because I truncated the y-axis, but the log-loss graph actually goes all the way up to +infinity, meaning that if you ever predict something as 0% and then it occurs, you made an infinitely bad prediction under the log-loss formula. There are lots of choices for error functions, and both of the above choices are reasonable. However, from what I’ve seen, log loss is generally agreed to be a superior error function out in the real world due to it better approximating the step function between 0 and 1.

Results

Anyway, I calculated errors with both methods for all three ranking distributions. Here are the top-level results for all predictions (lower is better):

We can see that my predictions came out on top in both methods, Ari got second, and Ignoramus got last. Which is a good sign! That means both Ari and I have more predictive power than the blind assumption that all teams are exactly equal. The amount by which my predictions beat Ari’s depends on the formula you use. In the RMSE formula, mine appear drastically better than Ari’s when referenced against Ignoramus, but in the log loss method, my predictions are only about twice as good as Ari’s compared to Ignoramus’. Don’t mind the magnitude difference between the RMSE and log-loss methods, it doesn’t make sense to directly compare different scoring methods in this way.

At a team level, here were the teams where I made worse predictions than Ignoramus according to the log-loss function:

  1. 696
  2. 1538
  3. 8
  4. 5818
  5. 604
  6. 3476
  7. 4488
  8. 1072
  9. 5012
  10. 3309

And here were the teams where I made worse predictions than Ari according to the log-loss function:

  1. 696
  2. 3310
  3. 8
  4. 4488
  5. 5012
  6. 5818
  7. 3476
  8. 3309
  9. 3647
  10. 604
  11. 5924
  12. 2990
  13. 4388
  14. 2910
  15. 846

I’ll be doing my own private investigation into all of these teams to see if there are any noticeable similarities, but if anyone thinks they see an obvious pattern among them I’d love to hear it.

Thanks

Huge thanks to Ari for putting out his predictions, I’m very happy someone else made something that I could compare myself against, as without that I’m blind to how good or bad my predictions actually are. I’m excited to see what else he puts out in the future.

 

Until next time,

Caleb