Optimal pairings. USCF rules, WinTD, and a program I wrote.

Yes, but do you also make mead? On my out the door to go to Pizza Nite at Dogfish Head…

Thanks, Bill.

That’s actually what I’m doing at this point. In order to do that effectively, though, I have to be able to make a legitimate comparison, rather than relying on cherry-picked examples. I have to be able to take real data from real tournaments and, using reasonable assumptions, show that the results of using one method really are different from another method. However, to do that, I have to make sure I have a reasonable implementation of both methods. If my program tries to use Standard Swiss/WinTD style pairings, but gets them wrong, the comparison isn’t valid.

Which brings me to the next problem. I added in draws and upsets, and now in my sample data I get to round 3. In round 3, there are 8 players that have one win. I show their ratings and due colors below, in top half/bottom half order.

1305 (B) 863 (B)
1229 (W) 673 (W)
1179 (B) 586 (B)
900 (W) 397 (B)

Each of these players has one win and one loss, and color allocation has been perfect. i.e, if a player is due black, he had black in round one and white in round 2. That also means that all swaps will be for “alternating” instead of “equalizing”. None have played each other except 863 and 673, who played each other in round 2 (at the edge of the two scoregroups. 673 got an upset win.)

Since there are 5 players due black, someone is not going to get due color. It seems to me that there are two options. If you swap 863 and 673, that’s a 76 point swap based on the rating of their opponents. That leaves one player, 586, without due color. The other option is to swap 673 and 586. That’s a 50 point swap, again based on opponents. Once again, that leaves one player, 863, without due color.

Is there an objective reason, based on some rule or some effect on the tournament, to favor one swap over another? Needless to say, my implementation of Standard Swiss chose one swap. WinTD chose another. Based on the rules, it seems arbitrary to me. I thought that WinTD might always choose the lower swap value, but I found that in Round 2 there was a “better” swap available, i.e. lower difference in rating, than the one it picked. Is there something I’m missing that favors one transposition instead of the other?

(And yes, I do make mead, but just a few gallons now and again as a hobby.)

You’ve got me on this one. I, too, would have thought WinTD would choose the smaller swap – especially since, in this case, one swap is smaller no matter whether it’s based on the players’ or the opponents’ rating difference.

By any chance, do you have anything in the “club”, “team”, or “family” fields for any of these players (or do any of them have the same last name) that might make WinTD reluctant to pair them against each other?

Some TDs, when forced to assign a bad color, prefer to push the bad color down as far as they can into the score group. That’s a bad idea, in my opinion, and I doubt whether WinTD would do that.

Bill Smythe

The rulebook does give a preference to a lower-point-differential swap. See the examples in 29e7.

I’m not certain, but it sounds like WinTD chose the 50-point (lowest) swap in round three and you are questioning its round two choice. Without seeing the details, I cannot comment on the round two choice.

Hmm. Upon re-reading your (Meadmaker’s) post, it does become unclear which round you were talking about:

So, did the discrepancy occur in round 2, and/or round 3?

In any case, I (along with jwiewel) would be interested in seeing exactly what happened in round 2.

Bill Smythe

I see what happened with WinTD in Round 2. Here’s the list of ratings and due colors.

2225 W 1305 W
2114 B 1229 B
2062 W 1179 W
1901 B 900 B

My program starts at the top. There’s a seventy-six point transposition at the top, then a 179 point transposition, which is legal for equalizing.

However, there’s a 50 point transpose available in the middle. Once you make that, you’re cut off from any further swaps. It would leave the top and bottom games with a color mismatch. I reprogrammed my system to favor the smallest swaps, and that’s what I ended up with. WinTD, on the other hand, ended up swapping the top and bottom pairs, avoiding the smaller 50 point swap in the middle.

If WinTD favors the smaller swaps, it must also try multiple passes. First, try the 50 point swap, and see that it leaves two equalizing color failures. Then, try starting with the next highest swap, and you can get to a perfect color allocation, so it chooses the better allocation, even though it involves larger swaps. That’s a bit more sophisticated programming. Right now, I don’t retry different starting points.

I spent some of today trying to make sure my program could do the statistical analysis. I haven’t gotten USCF swiss pairings right yet, so I just did a baseline comparison of the Lame method to totally random, zero effort, pairings. That’s the system that just picks players, and their colors, at random in every round. I programmed it to calculate two metrics. The first was “rank difference”. We know the expected finish order based on the true ratings. After determining the final finish order, using points then modified median, we compare the two. The “rank difference” is the sum of the squares of difference between the expected order of finish and the final order of finish. If people finish exactly in the expected order, the rank difference will be zero.
There will always be rank difference, and there should be, but if the pairing system gives players advantages, the rank difference will be larger. After running my sample data several times, the average rank difference is 76.4. The average rank difference from random pairs was 303.4. Good. That’s the expected results. The random pairings mean some players will play only weak opponents, and will score higher than they deserve.

Another metric is average difference between rating points in played games. The Lame method had an average rating difference of 776 points. The random method had an average rating difference of 897 points. Once again, good. We would expect any variation of the Swiss system to push similarly rated players together, and so have more competitive games. Confirmed.

The interesting comparison will come between USCF swiss and the Lame method. If only I can get the bugs out of the USCF implementation.

WinTD does do multiple iterations to find the best pairing it can. One thing you can do is use the check box to activate the logging of pairing logic so that you can review it and see why it chose the pairings it did (a 76-point and 161-point transposition fixing all four games rather than a single 50-point transposition that fixed only half of the games). If the 1901 had been 1801 then it would have done only the 50-point transposition.

On pages 152-155 the rulebook discusses “top-down” vs “look-ahead” pairings. One should always strive to optimize the entire score group, rather than each player one at a time.

I’m sure it does, but I doubt whether that is (or should be) the first step.

WinTD probably first counts total due-whites and due-blacks, to see how many forced bad colors there will be. In this case, the difference is zero, so there should be no bad colors. So it would reject, out of hand, any pairings that leave bad colors. It sees that, therefore, the top pairing must be swapped, so it doesn’t even consider swapping the second and third pairings.

This thread started off with many of us saying, you’ve got a lot of work ahead of you if you expect your program to make pairings as good as those made by the established commercial programs. I guess now that prediction has come full circle.

Your program’s main contribution, so far, is the discovery of the middle-vs-middle (“double Harkness” or “Lame”) variation in dealing with score groups with an odd number of players. By now I am firmly convinced that, for odd score groups, the Lame variation is* always superior to the Harkness variation, and

  • sometimes superior to the bottom-vs-top main-line rule, and
  • almost always at least as good as bottom-vs-top.
    It’s time for all of us, and especially the rules committee, to consider changes for 2013. The first change should be to replace the Harkness variation with the Lame variation as the main alternative to bottom-vs-top in pairing odd score groups.

True, and that would be preferable to what the average inexperienced human TD would do – swap the top two, then notice the 261-point difference at the bottom, and leave the bottom two alone.

And let’s face it – almost all human TDs these days are inexperienced, because they let the programs do it. Pairing is rapidly becoming a lost art.

Bill Smythe

I’m having trouble figuring out how WinTD handles the multiple “float” issue. I fixed the most recent problem I had, with looking down multiple paths for improvements, but I then noticed a discrepancy between my pairings and WinTD. It looked like WinTD refused to send one player down, and instead chose a different “odd player”. I couldn’t see any reason to pick the chosen player over the lowest ranked player, except that the lowest ranked player in the scoregroup had already been floated before,…twice.

So, it has to be ignoring the rule at least in part, because the player had been floated twice. However, it didn’t float him the third time. It could be some sort of ratings comparison issue, or it could be simply a two versus three issue. i.e, maybe it will float twice, but not three times.

USCF rules are (as far as I know) silent on the question of downfloating (or upfloating) the same player twice. FIDE rules, however, discourage such multiple floats. It’s entirely possible WinTD defers somewhat to FIDE in this area.

It could also depend on whether the two downfloats occurred in consecutive rounds (which would likely be more strongly discouraged than in non-consecutive rounds). And/or, it could depend on whether the first downfloat vindicated itself (i.e. the downfloated player lost). And/or, whether the player was upfloated between his two downfloats. And/or, how frightfully inconvenient (and distorting) it would be to avoid the second downfloat. And/or, and/or, and/or …

It would be extremely interesting to see the crosstable (including colors, of course).

Bill Smythe

In Editions 1 & 2 that was a variation. I can’t find either 3 or 4 to see when it disappeared.

It seems to have disappeared in the third edition. Section II.6.D is “Rules on the Odd Man,” and II.6.D.1 reads:

There are a couple of interesting observations to be made here. First, there is no mention of “floats” and no provision for avoiding giving the same player consecutive floats in the same direction. Second, there is no provision for choosing a different player to treat as the odd man solely to improve color allocation in the higher score group. However, the last paragraph of II.6.D.2 allows the director to pair the odd man with a lower-rated player in the next-lower score group to improve color allocation:

As often happens when code is thrown together in an amateur fashion, by which I mean a couple of hours here and there in my spare time, my Swiss algorithm just had too many band aids in it. I had to take a step back and actually look at the process and get it right, so I am revising my structures to get rid of the bugs.

In the meantime, I was thinking about metrics, and it occurred to me that I might be using bad metrics to determine whether a given pairing system creates an advantage for a player. My assumption was that by measuring the difference between a player’s rightful rank, based on true rating, versus his actual rank really achieved in the tournament, the difference is due to beneficial or detrimental pairings. Of course, that is not true in any given tournament, but it should be true over the course of 1000 tournaments using the same players.

However, what occurred to me was that some pairing systems force players of similar rating together more than others. If players are close in ratings, there will be more upsets, and more variation in the finish order. It’s not necessarily true that one player had a benefit. It’s just that there was more opportunity for upsets that forced the final result away from the expected result.

I’m trying a different method of measuring pairing benefit to players. One of the motivations for the Swiss System is to force players to play people of similar abilities. One consequence of this is that at the end of the tournament, the winners would ideally also be the players that played the most difficult opponents. So, at the end of the tournament, I will look at each player’s opponent list. Using the opponents’ true ratings, I will calculate what true rating would be required to achieve a score equal to 1/2 of the number of games played, and then use that calculated score to rank players. In other words, I will be ranking the players based on how difficult it would be to beat the opponents that they played. Ideally, this rank would also be equal to the finish order. By measuring the difference between the finish order, and the opponent difficulty, a measure of player benefit from favorable pairings can be estimated.

It also occurred to me that a few places different in the middle of the tournament is not as significant as at either end. In other words, if there are 16 players in the tournament, there is hardly any differnce between 7th place and 9th place, but there is a big difference between 1st place and 3rd place. So far, my metrics haven’t dealt with that, so I’ll be trying to come up with one that will.

That’s not how it actually does it, though the effect is the same in most cases. WinTD assigns a pairing “score” to a set of pairings with 0 being ideal - everyone is playing in order top half vs bottom half in each score group and all the colors work. Every deviation from ideal is penalized an amount based upon the severity. A rematch of an earlier pairing is effectively infinite, while a flip flop of < 80 points (or whatever the setting is for the alternation limit) is small, small enough that having every board in the group changed by less than 80 gives a lower penalty than just one alternation error. There are also penalties for changes between 80 and 200, equalization errors, changes over 200, top-bottom interchanges, playing out of score group. It then tries a whole bunch of pairing swaps to see if it can reduce the penalty score.

Interesting. This idea is reminiscent of an older discussion, Point-Count Pairings.

Thank you also for reviving the Transposition Limits thread.

Bill Smythe

I think your program has gotten so far away from the original intent, and has gotten bogged down in minutiae.

To some extent, yes, but right now, those “minutiae” are the USCF rules. I’m trying to figure out in which cases I am violating them, and in which cases there’s just vagueness in the rules.

Meanwhile, I think the effort to quantify the effects of different pairing systems is worthwhile. Based on my studies so far, the Lame method appears to result in better results than standard USCF pairings. It seems less likely to give players unearned advantages as a consequence of favorable pairings. That’s a fairly significant finding, and worth paying attention to. However, before I make such a claim in a wider forum than this thread, I want to have a more solid basis in evidence. I have to be sure that it isn’t due to a bug in the program, a flaw in my competition model, or a fluke of the random sample data. Right now, I haven’t done enough analysis to rule out those possibilities, so for the moment these remain interesting preliminary findings, worthy of further study.

Definitely, and this is your biggest contribution (rather than the program itself) so far.

You need to be careful here, because you can’t really say what’s “better” until you define your objectives.

In this thread you have done a decent job of defining your objectives. If, however, others disagree with your objectives, they are likely also to disagree as to what is better.

The part I like best is your handling of odd score groups. Middle-vs-middle seems superior to Harkness (middle-vs-top), and at least as good (in most cases) as bottom-vs-top.

Bill Smythe

Absolutely. My previous post erred in not defining “better”. In that case, what I was referring to was that the Lame method was more lilkely to end up with final results that were close to the results that would be achieved if ranking players based on their true abilities. I kind of hinted at that in my previous post, but wasn’t sufficiently explicit.

And note that while a real TD doesn’t have access to “true abilities”, in my simulator, I do. If I play enough rounds, I can know exactly what order my simulated players will finish in, becuase I have access to the “true rating”.

On the other hand, the Lame method seems to accomplish this goal by providing easier pairings for the favored players. At the end of the tournament, there were more discrepancies between the final rank and the difficulty of opponents than seen with standard Swiss. This would tend to not be considered “better” by many people. There will undoubtedly be tradeoffs in the use of different methods.

Do you intend to compare your program to SwissSys pairings? I compared SwisSSys to WinTD many years ago. All I could determine was that sometimes they created different sets of legal parings. “Best” was a matter of opinion in that test.