Bridge Rating Study - Follow Up
(Page of 7)

In my previous post, I presented a rating study from ACBL tournament data. It generated a lot of interest and comments. One player pointed out to me that one plot on page 4 showed actual score vs predicted had a slope of 2. Since then I have done some more study to understand the data and see if there is any room for improvement. Here are the updated results.

In order to explain the details I have to go a little deeper into the math model. Elo methodology uses an expectation function to calculate expected scores. Then it compares against the actual score and creates an adjustment. The technical document referenced from my original post has details. Here I just want to mention there is a K factor that determines how large this adjustment is. The K factor also has a component that is determined by Number of Effective Boards (NEB). The ratio of NEB and number of boards played in recent game determines K factor.

The expectation function I choose is:

F(a-b) = htan((a-b)/2000)

where htan is hyper-tangent function, a and b are a pair's rating. The hyper-tangent has a range from -1 to 1. This corresponds to a win/loss of a board. 2000 is a scale factor that has an effect on the absolute value of the rating but it does not change the characteristics of the model.

I used a sub-set of data to study changes in scale factor, NEB, and K factor; I discovered such changes do not make any significant change in the results.

Next, there is a factor that is unique to bridge because matchpoints are calculated by comparing scores between different tables. So the expected score needs data from at least two tables, or four pairs. My initial choice of the formula to determine this score is:

S = (F(a-b) - F(c-d))/2

where F(a-b) is expectation function from table 1 and F(c-d) is from table 2, a, b, c, d are NS and EW pairing rating at each table respectively. The factor 2 normalized the result value to range -1 to 1. It was done this way because a lot of initial work was done manually and this only requires calculating the expectation function once per table. Then it could derive all combinations of any different tables.

There is another way to construct this score by the following:

S = F((a-b)-(c-d))

In this formula, instead of calculating expectation function based on the difference in rating between two pairs, it calculates from the differential of rating difference of two tables. Players who are familiar with math could run a Taylor expansion for these two and evaluate the difference.

The results from this formula change from the test sample showed some improvement. So I reprocessed my library of ACBL tournament data using the new formula but kept all other parameters unchanged. The results are shown in the following pages.

The data set used in the following diagrams were from a recent 12 months of tournament data (11/2018 - 10/2019). This should be a set with best quality and large enough to provide statistically valid conclusions. It had a filter to require pairs to have regular rating prior to the game (>200 boards). This is the same selection used in previous study.

The distribution of actual score - prediction is shown in the following histogram. The mean value of this distribution is 0.12% and a sigma of 5.89%. The previous study had a mean of 0.38% and sigma of 5.93%. The 2D contour plot is shown in the next graph. Comparing with the same figure from previous study, the slope has changed a little and is still a value > 1. However this slope is just peak values. Each actual score has a corresponding predicted value. To evaluate how good the prediction is, we need to check all actual values for this predicted score. This means for each bin of X axis (a predicted score) we should check the actual score (Y value) distribution.

The following figure shows selected data points with a Predicted score = 57%. The distribution is the actual scores of all data points with Expected S = 57%. This is a typical distribution for a prediction greater than 50%. The actual score distribution has a longer tail on the left side than the right. The mean value for this distribution is 56.5%. The data selection for the plot had a upper limit of 70% so Excel treated anything higher than 70% as 70%. So the actual value might be even higher.

The next plot shows the actual score (mean value) vs predicted value for entire data. Their is a filter of each predicted score has >100 data points to remove low statistical data. The relationship is linear with slope very close to 1, especially in the central region which had a lot of samples. At each end the divergence from linear could be attributed to low sample size.

The above results showed the model's prediction is in reasonably good agreement with actual data. However there is an intrinsic distribution of actual scores with a sigma of ~6% due to limited number of boards per game (usually 24-27).

Another issue raised by players was survivor's bias. The above data only used results from pairs with greater than 200 boards. These tend to be successful tournament players with a stable partnership. One could speculate this group of pairs are better than those not in this group. This might bias the pairs in this study towards having better scores than their prediction.

It might be interesting to see how pairs with fewer than 201 boards perform. These plots are shown on the next couple of pages. The shape is skewed from regular Gaussian distribution. The mean value of Actual Score - Prediction is about -0.18% with sigma of 6.23%. Since this set of data is from pairs without an established rating, odd behaviors are somewhat expected.

For reference the combined data (N>200 + N <=200) had a mean value of -0.001% with sigma of 6.03%.

The 2D contour plot looks like the following: It had data more concentrated in the <50% region for both predicted and actual scores, an indication that the data was from those less successful in tournaments.

Finally, the mean values of actual scores were compared against predicted score and results are in the following plot. This result is similar to those pairs with an established rating, but both ends diverged more.

Getting Comments... 