You are ignoring the author of this comment. Click to temporarily show the comment.

Lior, I got your meaning of IMP score implied a comparison between different players.

However when the score is not calculated at board level, but an average per session of 0.2 IMP/board, a lot of things are indistinguishable. For example a 12 IMP score on board level would averaged out when you looked at session score. In addition you don't have a single opponent to compare and have average their effects.

At session level, an average+ game would have 0.2 IMP/board and 53%, but they could be quite different at board level. I did calculate both IMP rating and MP rating. They are correlated but NOT strongly correlated. I'm not surprised that EBU find their results at session level strongly correlated because they lack of data granularity to see the difference. The problem I have now is I don't have enough IMP game data. Most of my study are MP data computed for IMP rating. Ideally you want to compute IMP rating using IMP game data, and comparing it against MP rating computed from MP game data.

You are ignoring the author of this comment. Click to temporarily show the comment.

Wei-Bung, you mixed probability with fact. When an event has not happen, it is a probability. When an event has happened, it is a fact. Fact is 100%. Before East played ♥2, it could be either in East hand or West hand. You could calculate each one's probability. Once he played ♥2, it is a fact. It is 100%. You could not take this card back and put it in West's hand. So no matter how you calculate, the probability for that card in West's hand is ZERO. If you calculation is still counting West could have this card, it is wrong. That 1/6 and 1/4 are now both changed to 1 after the card is played.

You are ignoring the author of this comment. Click to temporarily show the comment.

Charles, you just made a self-conflicting statement. First, you said “every card played may change the odds, yes”, then you said the probability (after playing one round of heart) “must apply in advance”. If I read it correctly, you meant the odd should not change.

I have explained in previous response why the odd changes after one round of heart. You have seen two cards and that eliminate the possibility these cards located in opposite side. If you would rather think in term of original position, you know the position of two out of six cards. The unknown is the 4 cards left. It does not matter how many choices opponents could select to play on first card. Once they played whatever the card they selected, it is 100%. Other choice do not exist any more.

You are ignoring the author of this comment. Click to temporarily show the comment.

Lior, as a former particle physicist, I concur with your comments about statistical model and variance. I disagree with your last part of comments because bridge is different from chess.

In chess, both players are starting with the same position (you could argue white and black makes a difference but the chance to draw white/black is equal). A chess game's result is definitive, win, loss or draw. So you could compute the chess rating from single game result because both player starting with equal and the result could be compared with a prediction by rating.

Bridge is different. Every board is a different hand. If you only know a score of 170, you could not judge it is a good score or bad score. It has to be compared against others who played the same board. In order to calculate rating, I chose to compare at board level where the players has exact the same position(same hand). The only difference is they face the different opponent but this is factored into the calculation. If you calculate it using session aggregated results like EBU, you introduce another systematical uncertainty that increases the variance. Finally MP and IMP has different strategy and I don't think one could be equivalent to the other in a definitive way. If you put these two types of game into one rating, it introduces another systematical uncertainty. When the variance due to these systematical uncertainty is large enough, it makes the rating result less meaningful.

You are ignoring the author of this comment. Click to temporarily show the comment.

The probability definitely changes after you played 1 round of heart, as it eliminated a lot of possible choices that were in initial position but becomes impossible once 4 cards are played.

For example, with initial 4-2 distribution, there are 5 possible choice for J with 2 cards, 10 with 4 cards. Once you played one round of heart, each side has showed a card. The possibility that these cards are in the other side are eliminated. Now you've already know the positive of 2 cards out of original 6. So the remaining possibility changed. There is only 1 choice J could be in the short side, and 3 choices otherwise.

You are ignoring the author of this comment. Click to temporarily show the comment.

I'm not sure this is a case you could use the normal probability calculation because of the discard. If opponent discards are random and we assume you know East has ♣Q, at this 3 card ending the odd West having the other ♣ is 3:2. In addition there is 25% chance East started with ♥J so it favors playing for drop.

However if you read West discarding of ♣9 as the last ♣(why not discard ♣2 if he could), then ♥ is 4-2 and the chance West having ♥J is 3:1 and you should play finesse.

What is West discard a ♥ instead of ♣9? If he starts with !Jxxxx, by restrictive choice he has to discard a ♥, it seems it is a stronger evidence for finesse.

You are ignoring the author of this comment. Click to temporarily show the comment.

Tim, as a matter a fact, USCF needs 26 games to get a regular rating. Before that it is consider “provisional”. Let's assume at their 26 games, one was rated 1220 and the other 1180 but they were really the same level. As long as they continue to play, their rating will continue to change and fluctuate around 1200. As Robin pointed out, there is an inherent measurement “error”. Everyone could have a good game or bad game in chess or bridge. So when we use game results to adjust rating, there is a built-in uncertainty that would make the rating fluctuates around the “true rating”.

The website below is a section's rating from last year's chess World Open. http://www.uschess.org/msa/XtblMain.php?201407068692.7 As you could see, a rating change of +-20 is very normal. For players that performed very well, their rating could go up by over 200 points (they probably did improve, if you click on their name, you could find their rating history).

You are ignoring the author of this comment. Click to temporarily show the comment.

Tim, it is true rating should measure the current ability. The problem is how to measure it. What we could see is just how well you did in a game, like you score 420 on board 1, -100 on board 2 etc. You might have a good game one day and a bad game next day. Even we average them over a large amount of games, there would still be an uncertainty.

There is a concept of “current performance rating”. You could think of it as what an unrated player would get after his first game. However this will have a very large uncertainty. Some rating algorithm that only takes recent certain number of games will have player's rating fluctuate widely.

What I have in my system is to calculate an expected score/handicap for every board based on the player's pre-game rating, his opponent rating and ratings from every other players that played the same board. Then I use the actual score comparing with this expected score/handicap. If player's score is better he gets a positive adjustment. If it is worse, it is a negative adjustment.

The idea is rating should reflect player “current ability” and we use his most recent game to validate it. If he performs exactly at what he is expected, there would be not adjustment. If his performance is better than his rating predicts, his rating get increased. If it is lower, it decreases. So the rating should eventually converge on player's “ability”.

You are ignoring the author of this comment. Click to temporarily show the comment.

This depends on what rating is set to measure. I think the rating should measure the playing ability. If a player with established rating stopped going to tournament for 3-5 years and resume to play again. Does he playing ability changed? Should he/she be treated as a “new” player? I think the reasonably assumption is his/her ability did not change.

For player that continue to play, their most recent game results are used to adjust their rating so older game automatically getting less and less weight. However there is a formula that produce different values for high rated players comparing with low rated players. It allows high rated player's rating changes less than low rated.

As Robin pointed out, part of rating is error/uncertainly. One good game or bad game would result an rating change but the player's ability did not change much. The next game it will change back. This variance is “error” and could be controlled by K factor.

You are ignoring the author of this comment. Click to temporarily show the comment.

Robin, I agree with you in general that rating consists two parts: a “true” rating and an error/uncertainty. However handling historical game is an different issue.

In EBU scheme, historical game will have a weight automatically decayed with time. In my system and chess rating system, a player's rating keeps what it is if he stops playing. So if you stop playing for a year, your rating will stay the same.

Once a player (in my system it is the pair) has an established rating, new game has less weight especially for higher rated players. So it tends to be stable. Only when players consistently performs above/below their expected level, their rating will change in one direction. If you are interested I could e-mail you my document with details. The degree how much rating changes from most recent game is determined by a K factor. I think I mentioned US chess federation adjusted their formula two years ago. They have had this rating system for 70 years but still changing the parameters. So even we had a rating system in place we could still fine tuning some parameters.

You are ignoring the author of this comment. Click to temporarily show the comment.

In chess rating and my rating system, the weight of a game based on time is implicit. All historical game effects are already in the current/existing rating. The most recent game will result in an adjustment, that carries more weight than historical games. However we don't go back to re-rate all historical games.

EBU is explicitly rating historical games with different rate. Other rating methods may only count a certain most recent games.

When I said all games are counted equally, I meant the games from different events (either NABC or club game) were treated the same when they were calculated. A board played by 3 tables and 100 tables will have a different weight. Because the calculation is done by comparing score against all other tables. So 3 tables only results in 2 comparison but 100 tables has 99. So the later automatically carries more weight. This just comes from the fact that how many times a board is played, no matter it is a national or club event. So a world wide game using same duplicate board could carry most weight. However I also have a normalization formula to set a limit on how much weigh a single board could have so it does not skew the result.

Rating should have some predictive value. In chess rating if a player is rated 400 points higher than opponent, he/she is expected to win 90% games (if I remember correctly). In my system, I chose 400 point different to give higher rated pair handicap of 1 IMP per board, or 10% in match point (55% vs 45%). These are parameters used in the calculation and could be adjusted during study period. A valid study would be to rate a universe of pairs and see how their rating change over time. The assumption is a pair's playing ability does not change over time, so if they are rated correctly the rating will never change. In practice, this may not be true because some pair improve over time.

In term of performance variance, they should be averaged out with statistics. This is one of the reason I calculate the result at per board level. If you only have one score per game, you only have one data point like 55%. When I calculate at board level, it has 20-30 data points. Even so there will be some variations. A new pair with provisional rating is allowed to vary their rating by large amount. This is determined by a K factor. Once they have established rating this factor will reduce. It will also reduce for higher rated pairs. The assumption is strong/establish pair are more stable and their ability changes slowly. So one bad/good game is more likely a statistically anomaly than real change in ability.

You are ignoring the author of this comment. Click to temporarily show the comment.

Brian, I have to disagree with you on this one. A board played 4S+1 at Vul is 650. This is “objective data”. The only subjective data are scores with adjustment which I excluded from rating calculation. This is like in physics, you take a measurement to determine how fast a ball drop from 10 ft high to ground. Different measurement taken by different people should yield same results.

If you are talking about subjective assumption in data analysis, I agree. I believe the basic assumption is one could measure player's ability based on their game results. Other than this I don't believe we should adjust the data. So I don't think it should weight certain games more than others. I don't think it should give one player more weight than the other. I don't think it is a good idea to adjust the score by some other factors. Let the game results speak for themselves.

You are ignoring the author of this comment. Click to temporarily show the comment.

It is my belief that rating system should be based on objective data, not subjective assumptions. It should rate everyone the same way, not because it is Meckwell and does something different.

This does not mean I object an individual player rating. I simply say I could not find an objective way to define it. As I told the story I had on BBO, some players believe their masterpoints are their individual rating.

I have tested my rating with NABC game data as well as club game data. For club level, I tested with games from 50 local clubs for 18 months that covered about 13,000 pairs. For the players I know they were rated in the range I expected. I presented this system only because I think it is an objective rating only based on game data and I have done enough studies to convince myself it could work. It is not perfect yet, it could get better if I could get more data to tune some of the parameters.

You are ignoring the author of this comment. Click to temporarily show the comment.

I found a link about Lehman rating. It is a simplified version. It also has a link to detailed version. http://www-personal.umich.edu/~bpl/oksimple.html As you could see in detailed version. There were a lot of mathematical assumptions. I could not agree with all of them. I would not say it is completely wrong but some of them would be questionable. For example if I were to try to attribute the result of a board between two players, I could give more weight to declarer than dummy. If it is defending side, I would attribute slightly more weight to the opening leader.

I think this effort of trying to divide rating between players introduces a lot of subjective factors.

Ping Hu

However when the score is not calculated at board level, but an average per session of 0.2 IMP/board, a lot of things are indistinguishable. For example a 12 IMP score on board level would averaged out when you looked at session score. In addition you don't have a single opponent to compare and have average their effects.

At session level, an average+ game would have 0.2 IMP/board and 53%, but they could be quite different at board level. I did calculate both IMP rating and MP rating. They are correlated but NOT strongly correlated. I'm not surprised that EBU find their results at session level strongly correlated because they lack of data granularity to see the difference. The problem I have now is I don't have enough IMP game data. Most of my study are MP data computed for IMP rating. Ideally you want to compute IMP rating using IMP game data, and comparing it against MP rating computed from MP game data.

Ping Hu

Ping Hu

Ping Hu

I have explained in previous response why the odd changes after one round of heart. You have seen two cards and that eliminate the possibility these cards located in opposite side. If you would rather think in term of original position, you know the position of two out of six cards. The unknown is the 4 cards left. It does not matter how many choices opponents could select to play on first card. Once they played whatever the card they selected, it is 100%. Other choice do not exist any more.

Ping Hu

In chess, both players are starting with the same position (you could argue white and black makes a difference but the chance to draw white/black is equal). A chess game's result is definitive, win, loss or draw. So you could compute the chess rating from single game result because both player starting with equal and the result could be compared with a prediction by rating.

Bridge is different. Every board is a different hand. If you only know a score of 170, you could not judge it is a good score or bad score. It has to be compared against others who played the same board. In order to calculate rating, I chose to compare at board level where the players has exact the same position(same hand). The only difference is they face the different opponent but this is factored into the calculation. If you calculate it using session aggregated results like EBU, you introduce another systematical uncertainty that increases the variance. Finally MP and IMP has different strategy and I don't think one could be equivalent to the other in a definitive way. If you put these two types of game into one rating, it introduces another systematical uncertainty. When the variance due to these systematical uncertainty is large enough, it makes the rating result less meaningful.

Ping Hu

Ping Hu

For example, with initial 4-2 distribution, there are 5 possible choice for J with 2 cards, 10 with 4 cards. Once you played one round of heart, each side has showed a card. The possibility that these cards are in the other side are eliminated. Now you've already know the positive of 2 cards out of original 6. So the remaining possibility changed. There is only 1 choice J could be in the short side, and 3 choices otherwise.

Ping Hu

You had an assumption of 50% ♣9 is by restrictive choice. You also had an assumption of 50-50 East having ♣2.

Ping Hu

Ping Hu

However if you read West discarding of ♣9 as the last ♣(why not discard ♣2 if he could), then ♥ is 4-2 and the chance West having ♥J is 3:1 and you should play finesse.

What is West discard a ♥ instead of ♣9? If he starts with !Jxxxx, by restrictive choice he has to discard a ♥, it seems it is a stronger evidence for finesse.

Ping Hu

The website below is a section's rating from last year's chess World Open.

http://www.uschess.org/msa/XtblMain.php?201407068692.7

As you could see, a rating change of +-20 is very normal. For players that performed very well, their rating could go up by over 200 points (they probably did improve, if you click on their name, you could find their rating history).

Ping Hu

There is a concept of “current performance rating”. You could think of it as what an unrated player would get after his first game. However this will have a very large uncertainty. Some rating algorithm that only takes recent certain number of games will have player's rating fluctuate widely.

What I have in my system is to calculate an expected score/handicap for every board based on the player's pre-game rating, his opponent rating and ratings from every other players that played the same board. Then I use the actual score comparing with this expected score/handicap. If player's score is better he gets a positive adjustment. If it is worse, it is a negative adjustment.

The idea is rating should reflect player “current ability” and we use his most recent game to validate it. If he performs exactly at what he is expected, there would be not adjustment. If his performance is better than his rating predicts, his rating get increased. If it is lower, it decreases. So the rating should eventually converge on player's “ability”.

Ping Hu

For player that continue to play, their most recent game results are used to adjust their rating so older game automatically getting less and less weight. However there is a formula that produce different values for high rated players comparing with low rated players. It allows high rated player's rating changes less than low rated.

As Robin pointed out, part of rating is error/uncertainly. One good game or bad game would result an rating change but the player's ability did not change much. The next game it will change back. This variance is “error” and could be controlled by K factor.

Ping Hu

In EBU scheme, historical game will have a weight automatically decayed with time. In my system and chess rating system, a player's rating keeps what it is if he stops playing. So if you stop playing for a year, your rating will stay the same.

Once a player (in my system it is the pair) has an established rating, new game has less weight especially for higher rated players. So it tends to be stable. Only when players consistently performs above/below their expected level, their rating will change in one direction. If you are interested I could e-mail you my document with details. The degree how much rating changes from most recent game is determined by a K factor. I think I mentioned US chess federation adjusted their formula two years ago. They have had this rating system for 70 years but still changing the parameters. So even we had a rating system in place we could still fine tuning some parameters.

Ping Hu

Ping Hu

Ping Hu

EBU is explicitly rating historical games with different rate. Other rating methods may only count a certain most recent games.

When I said all games are counted equally, I meant the games from different events (either NABC or club game) were treated the same when they were calculated. A board played by 3 tables and 100 tables will have a different weight. Because the calculation is done by comparing score against all other tables. So 3 tables only results in 2 comparison but 100 tables has 99. So the later automatically carries more weight. This just comes from the fact that how many times a board is played, no matter it is a national or club event. So a world wide game using same duplicate board could carry most weight. However I also have a normalization formula to set a limit on how much weigh a single board could have so it does not skew the result.

Rating should have some predictive value. In chess rating if a player is rated 400 points higher than opponent, he/she is expected to win 90% games (if I remember correctly). In my system, I chose 400 point different to give higher rated pair handicap of 1 IMP per board, or 10% in match point (55% vs 45%). These are parameters used in the calculation and could be adjusted during study period. A valid study would be to rate a universe of pairs and see how their rating change over time. The assumption is a pair's playing ability does not change over time, so if they are rated correctly the rating will never change. In practice, this may not be true because some pair improve over time.

In term of performance variance, they should be averaged out with statistics. This is one of the reason I calculate the result at per board level. If you only have one score per game, you only have one data point like 55%. When I calculate at board level, it has 20-30 data points. Even so there will be some variations. A new pair with provisional rating is allowed to vary their rating by large amount. This is determined by a K factor. Once they have established rating this factor will reduce. It will also reduce for higher rated pairs. The assumption is strong/establish pair are more stable and their ability changes slowly. So one bad/good game is more likely a statistically anomaly than real change in ability.

Ping Hu

If you are talking about subjective assumption in data analysis, I agree. I believe the basic assumption is one could measure player's ability based on their game results. Other than this I don't believe we should adjust the data. So I don't think it should weight certain games more than others. I don't think it should give one player more weight than the other. I don't think it is a good idea to adjust the score by some other factors. Let the game results speak for themselves.

Ping Hu

This does not mean I object an individual player rating. I simply say I could not find an objective way to define it. As I told the story I had on BBO, some players believe their masterpoints are their individual rating.

I have tested my rating with NABC game data as well as club game data. For club level, I tested with games from 50 local clubs for 18 months that covered about 13,000 pairs. For the players I know they were rated in the range I expected. I presented this system only because I think it is an objective rating only based on game data and I have done enough studies to convince myself it could work. It is not perfect yet, it could get better if I could get more data to tune some of the parameters.

Ping Hu

http://www-personal.umich.edu/~bpl/oksimple.html

As you could see in detailed version. There were a lot of mathematical assumptions. I could not agree with all of them. I would not say it is completely wrong but some of them would be questionable. For example if I were to try to attribute the result of a board between two players, I could give more weight to declarer than dummy. If it is defending side, I would attribute slightly more weight to the opening leader.

I think this effort of trying to divide rating between players introduces a lot of subjective factors.