You are ignoring the author of this comment. Click to temporarily show the comment.

My understanding is that some of the Swedes took a look at the Polish light opening systems and started developing their own modifications. This lead to the development of a number of systems starting with Carrot Club which late spawned Carrotti and O'Carrot.

When the powers that be started cracking down on the forcing pass systems, Carrotti mutated into Magik Diamond.

You are ignoring the author of this comment. Click to temporarily show the comment.

If a charity is unable to raise and disperse funds efficiently, should it really be operating? I understand that people want to do good things for <whatever>, but at a certain point you really need to ask whether the charity is serving a true cause or the folks administering it.

You are ignoring the author of this comment. Click to temporarily show the comment.

These numbers are absolutely horrifying.

World class charities are spending $15 on expenses for every $100 that they give out. I understand that the educational foundation is small, that there are lots of fixed costs, and this reduces the efficiency of the charity. But this deserves to be shut down. It appears to be largely operating for the benefits of its administrators and support orgs, not whatever it is ostensibly supporting.

You are ignoring the author of this comment. Click to temporarily show the comment.

> I think individual power ratings are more > useful to the game long term.

You may very well be right. However, the reason that I (as well as several very well known authorities on this subject like Marc Glickman) advocate starting with a pairs rating system is that it is a very useful first step on the way to generating accurate individual ratings.

Developing a pairs rating scheme is a much simpler problem.

If you are unable to solve it, then you probably can't develop a good pairs rating scheme. Conversely, if you can solve this you can use the pairs ratings as a tool for estimating individual ratings for the different members of the partnerships.

> I can hardly be the only one where masterpoints are a bad indicator.

Perhaps the real problem is that bridge is a partnership game and you'd be better served establishing some serious partnerships rather than complaining that the partnership desk isn't serving you well?

> And, even for a new partnership, by the time we > have enough events for a rating, I already have a > good sense of how good we are. So again, not > so useful to me.

You are ignoring the author of this comment. Click to temporarily show the comment.

> Also, would you plan to start with a “small” series of > tournaments, to give you a core of gradings that is > then gradually extrapolated outwards to involve > the entire bridge community?

I would start with a small number of tournaments with a relatively small number of pairs because I need to do trials. if I can't get this to work, then doing anything larger / more complicated seems highly problematic.

In a perfect world, I'll learn useful things that can be applied to the more complicated case.

FWIW, it is very tempting to do all of this using a set of bots.

In theory, I could create different “versions” of Jack whose skills differ based on the number of hands that they are able to use to perform their monte carlo simulations.

If I used this as a starting point, I could cheaply generate a large amount of data and I'd have some objective standard to use when I look at my results.

You are ignoring the author of this comment. Click to temporarily show the comment.

> I also wondered if using Match Points took away the > need for calculating separate standard deviations in > IMPs for every separate board.

Converting to MP centers and scales the data and removes the need to do this step. However, the inherent variance in the results should still be very important.

If a board was dead flat, it probably says more about the characteristics of that board than the strength of the pair that your competed against.

You are ignoring the author of this comment. Click to temporarily show the comment.

> I am puzzled at the reason for the reduction of results > that compare your partnership against an entire field > to a comparison with another individual pair.

Couple comments:

1. I like to start with simple examples and move to more complicated ones.

A model in which I am trying to understand how well pair A does versus pair B given some (reasonable) number of boards is as simple as it gets. I am adding in the standard deviation of the MP results to try to incorporate some information regarding the variance in the set of the field results.

2. Overfitting is going to be a big concern.

If I want to add information about “the field”, more specifically, who is playing whom, then I need to add a whole bunch of categorical variables. And, ultimately these categorical variables are going to get expanded into enormous design matrices with lots of 1's and 0's.

So, let's assume that my local club has a total of 6 tables. (12 pairs in total)

I need 12! different variables to code all the different possible combinations of pairs that might plausibly be competing against each other on a given board. (On any given week, I can probably get away with 6!, however, different people sit in different directions week by week, so…)

With this many explanatory variables and relatively few board results any resulting model really runs the risk of overfitting like mad. And, in turn, this means that a whole bunch of different variables are going to get dropped during the modelling process.

You are ignoring the author of this comment. Click to temporarily show the comment.

I spent some time this weekend playing around with some different approaches at rating the performance of pairs of bridge players. The primary motivation was to get a better idea how to structure the sets of independent variables and arrange the data so I could feed it into some kind of recurrent neural network. I ran into a number of interesting design choices, many of which required making assumptions regarding how complex a model that I wanted to use. I thought that some of this might be of wider interest and am hoping that folks might have some opinions on some of these topics. (Please note: I was working with Match Points. A whole bunch of this analysis is a lot easier if we’re looking at large teams events; especially if they are using a KO type format. It is somewhat tempting to start with developing a model that would look at the Bermuda Bowl in year foo or some such and try to model how strong each of the pairs / teams were in that year.)

I have a time series that shows the weekly board results at a bridge club. My data set includes the following pieces of information.

1. The board number (which codes information about vulnerability) 2. The set of scores that occurred when this board was played 3. The set of pairs who competed in each event 4. Which pairs competed against one another on this hand 5. The direction that each pair was sitting on a given hand

(I also have the specific hands that were dealt for each board, but I am already incredibly concerned about over fitting so I’m not going to bring that into the analysis at this point in time).

At the most basic level, the goal here is to create a predictive model that is able to look at a set of past scores and be able to accurately model the results of a future event. What makes this problem interesting is the amount of feedback between different parts of this system. For example, we might consider creating a naïve model in which we assume that each pair has some “objective” degree of skill. Each time pair foo plays against pair bar there is an expected value for this result, along with some degree of variance. Look at enough hands and we can estimate how strong pair foo is relative to pair bar. (All well and good)

However, life is a bit more complicated than this. Bridge pairs play boards against more than one opposing pair. We want to know how strong a given pair is against a field of potential opponents AND to make matters more complicated, we can’t just develop a whole bunch of pairwise comparisons because we want to estimate how well a pair might perform if they competed against some other pair that they have never faced before.

And, of course, we could certainly take things a bit further. In reality, some board are going to be naturally flat while other boards are going to be swingy. We could incorporate this information into our modelling process. For example, we could calculate the standard deviation of the MP score that occurred on a given board and include this as an independent variable. (Imagine a model in which a player’s expected score was a function of the pair that the played against and the standard deviation of the results on this board across the field)

If we wanted to get even more complicated, the set of board results are, themselves, a function of the set of pairs that competed on your board. So, in this case what we actually have is a large set of simultaneous equations in which all the pair ratings need to be calculated across the full set of pairings.

You are ignoring the author of this comment. Click to temporarily show the comment.

Incorrect

Throughout the discussions about ratings systems I have been pretty consistent in

1. Stating that people need to decide on why they want a rating system 2. Stating that if you want an accurate rating system, here's how to proceed

I will readily admit that I don't consider

1. Ratings systems as a marketing scam 2. Ratings systems that aren't particularly accurate

to be interesting topics for discussion

As for unintended consequences… While there is certainly the possibility of unintended consequences from some given action, the same holds true for failing to take action. That's why you need some kind of measurement tool.

You are ignoring the author of this comment. Click to temporarily show the comment.

If the masterpoint committee is seriously interested in this sort of work, they should hold a public contest and solicit multiple entries.

From my own perspective, if the Masterpoint committee really want to overhaul MP allocations, adding in strength of field considerations should be secondary to compensation for the intrinsic variance in events; by which I mean that they should start bootstrapping results and using confidence bounds to adjust MP allocations.

MUCH easier to implement.

I'll also note that strength of field adjustments seem most necessary when you have poor mixing across player populations and, regretfully, this is also the situation where your ratings schemes are going to run into the greatest degree of difficulty.

You are ignoring the author of this comment. Click to temporarily show the comment.

> And it seems to me as a layman it would > be a good basis for a universal system.

As a non layman, let me try to explain why you are wrong:

1. There is good reason to believe the accuracy of the NGS system can be improved.

At the most basic level, the NGS grading system ignores information that most certainly have a significant impact on player's performance. Most notably

The NGS (typically) does not (typically) calculate results on a board by board basis. While it has a notion of the overall strength of the field, it doesn't usually consider whether you played against strong pairs or weak pairs competing in that event.

In a similar vein, the NGS doesn't factor in whether you played a flat boards against a strong pair or a weak pair.

There are a lot of ways in which this system can plausibly be improved.

2. No one is saying that the NGS shouldn't be used. Rather, people are suggest that a variety of different approaches should be evaluated / test such that we can make an informed decision about what the best approach might be.

You are ignoring the author of this comment. Click to temporarily show the comment.

Spent some time thinking about this on the commute into work:

From my perspective, the most important issue to focus in on is the sources of the variance in player’s results. I am going to (superficially) divide the sources of variance into three large buckets:

1. Player’s skills change over time. Beginners can improve. Established players can fall out of practice. Perhaps your partnership switched bidding systems which caused a short term decrease in skill but leads to a long term improvement.

2. Players can have a “good” or a “bad” day. This can incorporate a wide variety of noise sources including: Did a player get a good night’s sleep the night before? Were the boards such that an inferior line of play worked whilst a superior one failed…

3. Issues specific to an individual event. For example: Which pairs were sitting in my direction versus the opposite direction? Which boards did I play against which pairs? (If I play a very flat board against a top pair my score is likely to improve. If I play the same board against a very weak pair this is going to hurt my performance).

While I like the way in which you are conducting your analysis, I think that you are focusing too much on the first two issues while ignoring / accepting the third. More specifically, you state that the results of bridge matches are going to be intrinsically noisy without recognizing that a modelling technique that explicitly accounts for what might otherwise look like noise can decrease the variance.

Taking this back to the realm of designing a rating system: A ratings system that operates on a board by board basis and incorporates information such as

1. The complete set of pairings across boards 2. Whether or not the scores on a given board were relatively flat or swingy

Has the potential to improve on the system that you are currently characterizing. The art of predictive modeling is incorporating the right set of explanatory variables such that you are accurately capturing the signal without incorporating the noise component.

I’d like to close with a (potentially) extreme example. Back in the weird old days when I was playing MOSCITO in a regular partnership, I started to notice that our board results varied dramatically across our opening bids. While I was quite happy with our overall results, I couldn’t help but notice that we tended to generate much better scores opposite our constructive limited openings and our preempts than we did when we held a strong club opening. (Our weak NT was pretty middling). These effects were significant enough that they were going to skew where we placed. If we were lucky and got dealt a whole bunch of 1D/H/1S/2D/2H/2S openings we’d end up with a rock crusher. Conversely, if we got unlucky and got dealt a whole bunch of strong club openings in 1st/2nd seat we might have a fairly mediocre game.

[Note: I am not necessarily arguing that board characteristics such as the actual hands should be built into a rating system. Arguably, choice of bidding system is part of your skill at playing bridge. I am simply pointing out that it is possible to quite quite a bit further in capturing sources of variance)

You are ignoring the author of this comment. Click to temporarily show the comment.

Couple immediate questions:

1. What algorithm are you using to calculate ratings? 2. How do you use this rating to generate a predicted score for a pair? 3. When you are generating your predicted versus actual chart, how did you generate this data? (How many pairs/ events?)

And an immediate reaction:

I am gratified that you are backtesting. I think that this sort of empirical work is critical to any serious evaluation.

Richard Willey

When the powers that be started cracking down on the forcing pass systems, Carrotti mutated into Magik Diamond.

Richard Willey

Richard Willey

World class charities are spending $15 on expenses for every $100 that they give out. I understand that the educational foundation is small, that there are lots of fixed costs, and this reduces the efficiency of the charity. But this deserves to be shut down. It appears to be largely operating for the benefits of its administrators and support orgs, not whatever it is ostensibly supporting.

Richard Willey

> I indicated which you didn't answer.

de gustibus non disputandum est

Richard Willey

> gone unanswered. Has one been published or even

> considered?

I doubt it

Richard Willey

> useful to the game long term.

You may very well be right. However, the reason that I (as well as several very well known authorities on this subject like Marc Glickman) advocate starting with a pairs rating system is that it is a very useful first step on the way to generating accurate individual ratings.

Developing a pairs rating scheme is a much simpler problem.

If you are unable to solve it, then you probably can't develop a good pairs rating scheme. Conversely, if you can solve this you can use the pairs ratings as a tool for estimating individual ratings for the different members of the partnerships.

> I can hardly be the only one where masterpoints are a bad indicator.

Perhaps the real problem is that bridge is a partnership game and you'd be better served establishing some serious partnerships rather than complaining that the partnership desk isn't serving you well?

> And, even for a new partnership, by the time we

> have enough events for a rating, I already have a

> good sense of how good we are. So again, not

> so useful to me.

This isn't all about you…

Richard Willey

> tournaments, to give you a core of gradings that is

> then gradually extrapolated outwards to involve

> the entire bridge community?

I would start with a small number of tournaments with a relatively small number of pairs because I need to do trials. if I can't get this to work, then doing anything larger / more complicated seems highly problematic.

In a perfect world, I'll learn useful things that can be applied to the more complicated case.

FWIW, it is very tempting to do all of this using a set of bots.

In theory, I could create different “versions” of Jack whose skills differ based on the number of hands that they are able to use to perform their monte carlo simulations.

If I used this as a starting point, I could cheaply generate a large amount of data and I'd have some objective standard to use when I look at my results.

Richard Willey

> need for calculating separate standard deviations in

> IMPs for every separate board.

Converting to MP centers and scales the data and removes the need to do this step. However, the inherent variance in the results should still be very important.

If a board was dead flat, it probably says more about the characteristics of that board than the strength of the pair that your competed against.

Richard Willey

> that compare your partnership against an entire field

> to a comparison with another individual pair.

Couple comments:

1. I like to start with simple examples and move to more complicated ones.

A model in which I am trying to understand how well pair A does versus pair B given some (reasonable) number of boards is as simple as it gets. I am adding in the standard deviation of the MP results to try to incorporate some information regarding the variance in the set of the field results.

2. Overfitting is going to be a big concern.

If I want to add information about “the field”, more specifically, who is playing whom, then I need to add a whole bunch of categorical variables. And, ultimately these categorical variables are going to get expanded into enormous design matrices with lots of 1's and 0's.

So, let's assume that my local club has a total of 6 tables. (12 pairs in total)

I need 12! different variables to code all the different possible combinations of pairs that might plausibly be competing against each other on a given board. (On any given week, I can probably get away with 6!, however, different people sit in different directions week by week, so…)

With this many explanatory variables and relatively few board results any resulting model really runs the risk of overfitting like mad. And, in turn, this means that a whole bunch of different variables are going to get dropped during the modelling process.

Richard Willey

I have a time series that shows the weekly board results at a bridge club. My data set includes the following pieces of information.

1. The board number (which codes information about vulnerability)

2. The set of scores that occurred when this board was played

3. The set of pairs who competed in each event

4. Which pairs competed against one another on this hand

5. The direction that each pair was sitting on a given hand

(I also have the specific hands that were dealt for each board, but I am already incredibly concerned about over fitting so I’m not going to bring that into the analysis at this point in time).

At the most basic level, the goal here is to create a predictive model that is able to look at a set of past scores and be able to accurately model the results of a future event. What makes this problem interesting is the amount of feedback between different parts of this system. For example, we might consider creating a naïve model in which we assume that each pair has some “objective” degree of skill. Each time pair foo plays against pair bar there is an expected value for this result, along with some degree of variance. Look at enough hands and we can estimate how strong pair foo is relative to pair bar. (All well and good)

However, life is a bit more complicated than this. Bridge pairs play boards against more than one opposing pair. We want to know how strong a given pair is against a field of potential opponents AND to make matters more complicated, we can’t just develop a whole bunch of pairwise comparisons because we want to estimate how well a pair might perform if they competed against some other pair that they have never faced before.

And, of course, we could certainly take things a bit further. In reality, some board are going to be naturally flat while other boards are going to be swingy. We could incorporate this information into our modelling process. For example, we could calculate the standard deviation of the MP score that occurred on a given board and include this as an independent variable. (Imagine a model in which a player’s expected score was a function of the pair that the played against and the standard deviation of the results on this board across the field)

If we wanted to get even more complicated, the set of board results are, themselves, a function of the set of pairs that competed on your board. So, in this case what we actually have is a large set of simultaneous equations in which all the pair ratings need to be calculated across the full set of pairings.

It’s actually a quite interesting question…

Richard Willey

Throughout the discussions about ratings systems I have been pretty consistent in

1. Stating that people need to decide on why they want a rating system

2. Stating that if you want an accurate rating system, here's how to proceed

I will readily admit that I don't consider

1. Ratings systems as a marketing scam

2. Ratings systems that aren't particularly accurate

to be interesting topics for discussion

As for unintended consequences… While there is certainly the possibility of unintended consequences from some given action, the same holds true for failing to take action. That's why you need some kind of measurement tool.

Richard Willey

> systems will affect the amount of duplicate bridge players

> choose to play?

Do you honestly believe that this can be done with any degree of accuracy?

I think that you're disguising a complaint as a request for (essentially meaningless) analysis

Richard Willey

From my own perspective, if the Masterpoint committee really want to overhaul MP allocations, adding in strength of field considerations should be secondary to compensation for the intrinsic variance in events; by which I mean that they should start bootstrapping results and using confidence bounds to adjust MP allocations.

MUCH easier to implement.

I'll also note that strength of field adjustments seem most necessary when you have poor mixing across player populations and, regretfully, this is also the situation where your ratings schemes are going to run into the greatest degree of difficulty.

Richard Willey

> be a good basis for a universal system.

As a non layman, let me try to explain why you are wrong:

1. There is good reason to believe the accuracy of the NGS system can be improved.

At the most basic level, the NGS grading system ignores information that most certainly have a significant impact on player's performance. Most notably

The NGS (typically) does not (typically) calculate results on a board by board basis. While it has a notion of the overall strength of the field, it doesn't usually consider whether you played against strong pairs or weak pairs competing in that event.

In a similar vein, the NGS doesn't factor in whether you played a flat boards against a strong pair or a weak pair.

There are a lot of ways in which this system can plausibly be improved.

2. No one is saying that the NGS shouldn't be used. Rather, people are suggest that a variety of different approaches should be evaluated / test such that we can make an informed decision about what the best approach might be.

Richard Willey

Richard Willey

From my perspective, the most important issue to focus in on is the sources of the variance in player’s results. I am going to (superficially) divide the sources of variance into three large buckets:

1. Player’s skills change over time. Beginners can improve. Established players can fall out of practice. Perhaps your partnership switched bidding systems which caused a short term decrease in skill but leads to a long term improvement.

2. Players can have a “good” or a “bad” day. This can incorporate a wide variety of noise sources including: Did a player get a good night’s sleep the night before? Were the boards such that an inferior line of play worked whilst a superior one failed…

3. Issues specific to an individual event. For example: Which pairs were sitting in my direction versus the opposite direction? Which boards did I play against which pairs? (If I play a very flat board against a top pair my score is likely to improve. If I play the same board against a very weak pair this is going to hurt my performance).

While I like the way in which you are conducting your analysis, I think that you are focusing too much on the first two issues while ignoring / accepting the third. More specifically, you state that the results of bridge matches are going to be intrinsically noisy without recognizing that a modelling technique that explicitly accounts for what might otherwise look like noise can decrease the variance.

Taking this back to the realm of designing a rating system: A ratings system that operates on a board by board basis and incorporates information such as

1. The complete set of pairings across boards

2. Whether or not the scores on a given board were relatively flat or swingy

Has the potential to improve on the system that you are currently characterizing. The art of predictive modeling is incorporating the right set of explanatory variables such that you are accurately capturing the signal without incorporating the noise component.

I’d like to close with a (potentially) extreme example. Back in the weird old days when I was playing MOSCITO in a regular partnership, I started to notice that our board results varied dramatically across our opening bids. While I was quite happy with our overall results, I couldn’t help but notice that we tended to generate much better scores opposite our constructive limited openings and our preempts than we did when we held a strong club opening. (Our weak NT was pretty middling). These effects were significant enough that they were going to skew where we placed. If we were lucky and got dealt a whole bunch of 1D/H/1S/2D/2H/2S openings we’d end up with a rock crusher. Conversely, if we got unlucky and got dealt a whole bunch of strong club openings in 1st/2nd seat we might have a fairly mediocre game.

[Note: I am not necessarily arguing that board characteristics such as the actual hands should be built into a rating system. Arguably, choice of bidding system is part of your skill at playing bridge. I am simply pointing out that it is possible to quite quite a bit further in capturing sources of variance)

Richard Willey

1. What algorithm are you using to calculate ratings?

2. How do you use this rating to generate a predicted score for a pair?

3. When you are generating your predicted versus actual chart, how did you generate this data? (How many pairs/ events?)

And an immediate reaction:

I am gratified that you are backtesting.

I think that this sort of empirical work is critical to any serious evaluation.

Richard Willey

Need to spend a bit of time re-reading and thinking about it.

Richard Willey

Declarer can look at their hand and dummy and decide whether to play

1NT

two of either major

three of either minor

Any game

Any slam

And will get the appropriate plus score if successful

Richard Willey