Performance Ratings (PR) & ELO: Measuring Backgammon Skill
There are two skill-measurement systems in serious competitive backgammon. ELO is the rating-based system inherited from chess, refined for backgammon's higher variance. Performance Rating (PR) is the error-rate-based system that emerged in the late 1990s once neural-network engines became strong enough to serve as evaluation oracles. Both are in active use, and a serious competitive profile carries both โ an ELO rating for matches played, and a PR for the quality of play across them.
This page covers the formal definition of each, the modern reference benchmarks, the historical PR/ELO profile of the original GamesGrid bot family (GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, MrHyperBot), and the framework the 2026 platform uses to surface these metrics in player profiles and bot opponent selection.
For the underlying engines and lineage, see Bots & AI. For the math that PR depends on (equity, match equity, gammon prices), see the mathematics pillar.
1. Performance Rating (PR)
Performance Rating measures the quality of play across a match (or sequence of matches) by comparing each move and each cube decision against a reference bot's evaluation. The metric is expressed in millipoints lost per move โ abbreviated mEMG (milli-Equivalent-Money-Game) โ and has the property that lower is better. A PR of zero is perfect play matching the reference bot's recommendations on every decision.
1.1 The PR Calculation
Formally, for a match of non-forced moves:
where is the equity (in money-game points) of the best move available on turn , and is the equity of the move the player actually made. The factor of 1000 converts to millipoints.
Cube decisions are folded into the same formula by treating them as moves with their own equity loss: an incorrect take is the equity loss compared to a correct drop (or vice versa), divided across the move count of the surrounding game in the standard convention.
The reference bot for the modern PR standard is eXtreme Gammon (XG2) at 4-ply evaluation, often with truncated rollouts for high-leverage positions. GNU Backgammon at 2-ply or 3-ply is the open-source equivalent and is used by most amateur analysis tools. The two reference oracles produce PRs within roughly 0.5 mEMG of each other across typical matches.
1.2 Reference Benchmarks
The following PR benchmarks are widely cited in modern competitive backgammon. They describe the typical PR of players in each skill band when their matches are analysed against XG2:
| Skill band | PR (mEMG) | Examples |
|---|---|---|
| World-class | < 3.0 | Top tournament finishers, World Championship contenders. |
| Expert | 3.0 โ 5.0 | Strong tournament regulars; PR 4.0 is the loose threshold for "expert." |
| Strong club player | 5.0 โ 8.0 | Active competitive player with significant study background. |
| Intermediate | 8.0 โ 12.0 | Solid recreational player with structured study. |
| Casual / beginner | 12.0 + | Less-experienced player; substantial scope for improvement. |
These are not strict cutoffs. PR varies match-to-match โ a strong player can play a "PR 8" match through bad concentration, and an intermediate player can play a "PR 4" match by getting lucky with simple positions. The signal becomes reliable across 50+ matches.
1.3 PR Variance and Sample Size
A single match has a PR standard deviation of roughly 2.0 mEMG for typical match lengths (5 to 9 points), driven primarily by the random selection of positions encountered. The standard error of an aggregate PR across matches falls as , so:
| Sample size (matches) | PR standard error |
|---|---|
| 1 | ~2.0 mEMG |
| 10 | ~0.63 mEMG |
| 50 | ~0.28 mEMG |
| 100 | ~0.20 mEMG |
| 500 | ~0.09 mEMG |
A reliable PR comparison between two players requires roughly 100+ matches against comparable opponent pools. This is the reason serious skill-assessment tournaments report rolling PR over the player's last matches, rather than a single-match PR โ a single-match number is statistically noisy enough to be misleading.
2. ELO
The ELO rating system was developed for chess by Arpad Elo in 1960 and adapted for backgammon's higher variance by Kevin Bastian and others in the early FIBS days (1992โ1995). FIBS used the modified formula that became the de facto standard across online backgammon servers including GamesGrid.
2.1 The FIBS ELO Formula
Backgammon ELO adjusts the standard chess formula by weighting rating changes by match length โ a longer match is a more reliable skill measurement and therefore moves the ratings more.
where is the player's expected match-winning probability, is the opponent's rating, is the player's rating, and is the match length in points.
The rating change after a match:
where actual is 1 for a win and 0 for a loss, and is a constant โ typically 4 for established players and higher for new players (so their ratings stabilise faster).
The system is calibrated so that a 100-point rating differential corresponds approximately to a 64% expected win probability in a single 1-point match โ but the longer-match weighting means a 100-point differential in a 13-point match implies a much greater (~75%) expected win probability.
2.2 ELO Benchmarks
GamesGrid (1996โ2008) used a base rating of 1500 for new accounts, matching the FIBS convention. The original published rating bands:
| ELO band | Skill bracket |
|---|---|
| Below 1300 | Novice |
| 1300 โ 1500 | Beginner |
| 1500 โ 1700 | Intermediate |
| 1700 โ 1900 | Strong / Expert |
| 1900 โ 2100 | Master / Tournament regular |
| 2100+ | Top-tier / international competitor |
These bands have remained relatively stable across the 1996โ2026 period. The 2026 platform uses the same base rating of 1500 with the FIBS-style match-length-weighted ELO formula, ensuring rating continuity for returning veterans of the original server.
3. The Original GamesGrid Bot Profiles
The 1996โ2008 GamesGrid bot family is documented in detail on the Bots & AI page. The bots were graded by induced error frequency applied to a common GNUbg-based evaluation engine, producing a controlled skill ladder. Their published ELO ratings (low / average / high across many thousands of matches) โ these values are quoted verbatim from the archived 2003โ2004 GamesGrid FAQ and are authentic historical data:
| Bot | Evaluation depth | ELO low | ELO avg | ELO high | Model-derived PR estimate |
|---|---|---|---|---|---|
| GG Forever | 2-ply, Life Members only | 1850 | 1920 | 2114 | ~4 โ 5 |
| GG Raccoon | 0-ply (no lookahead) | 1850 | 1920 | 2114 | ~5 โ 6 |
| GG Otter | 0-ply + induced errors | 1543 | 1701 | 1827 | ~9 โ 11 |
| GG Weasel | 0-ply + frequent induced errors | 1410 | 1516 | 1652 | ~13 โ 16 |
| GG Chipmunk | 0-ply + high-frequency induced errors | 1171 | 1275 | 1487 | ~18 โ 22 |
| MrHyperBot | Hypergammon database (exact) | 1797 | 1953 | 2109 | Perfect play in the hypergammon variant |
Caveat on the PR column. The PR-equivalent values are model-derived estimates inferred from the bots' published ELO ranges and from the typical relationship between ELO and PR observed in modern competitive play. The original GamesGrid match logs are not currently accessible for direct XG2 re-evaluation; if those logs become available, the column will be replaced with measured values. Treat the published PR estimates as rough banding rather than precise rating.
A few observations from the table:
- GG Forever and GG Raccoon were the same underlying engine, but Forever used 2-ply lookahead while Raccoon used 0-ply. The 2-ply lookahead produces a meaningful PR improvement โ roughly 0.5โ1.0 mEMG in modern terms โ but did not produce dramatically different ELO outcomes against the typical 1996โ2008 player pool, because most opponents were not skilled enough to systematically exploit the 0-ply weaknesses.
- MrHyperBot played a fully solved variant (hypergammon, 3-checker game) with an exact-position database. Its PR within the variant is, in principle, zero โ every move was the proven best โ but in the smaller hypergammon state space the dice contribution to outcomes is proportionally larger, so its ELO ratings still varied 312 points over its operational lifetime purely from variance.
- GG Chipmunk at the bottom of the ladder was designed to be beatable by recreational players. Its 0-ply + high-error mode produced highly entertaining games (more blunders, more contact, more action) at the cost of game quality. This was a deliberate design choice for the bot's role.
Standard Deviation of Error Rates (model-derived)
The original bots' induced-error patterns were stochastic, not deterministic โ a fixed proportion of moves were forced to suboptimal alternatives chosen at random from positions where the equity loss fell within a target range. The expected standard deviation per match of the resulting error rate (in mEMG) is governed by the bot's mean error rate and the typical SD/mean ratio for stochastic induced-error systems; the model-derived estimates are:
| Bot | Mean error rate (estimate) | SD per match (estimate, mEMG) |
|---|---|---|
| GG Forever (2-ply) | ~4.5 | ~1.0 |
| GG Raccoon (0-ply) | ~5.0 | ~1.2 |
| GG Otter | ~10.0 | ~2.0 |
| GG Weasel | ~15.0 | ~2.5 |
| GG Chipmunk | ~20.0 | ~3.0 |
Caveat. These are model-derived estimates based on the typical SD/mean ratio for stochastic induced-error systems and the bots' published ELO ranges โ not direct measurements. The higher-error bots are expected to show higher match-to-match variance, which is consistent with the substantial "rating swing" behaviour documented in the bots' operational ELO histories.
4. The 2026 Bot Framework
The new GamesGrid retains the design philosophy of the original bot family: a graded skill ladder, named characters, induced-error mechanisms for the weaker bots, and full-strength NN play for the top of the ladder. The original GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, and MrHyperBot are all reconstituted on the new engine, calibrated where possible to match their historical playing fingerprints. They are joined by a new generation of named bot opponents that fill out the full skill ladder, with character profiles spanning beginner through world-championship calibre.
The specific bot roster, the structure of the Career Mode bot leagues that they populate, and the precise PR profile of the strongest 2026 bots will be published closer to launch.
Two commitments hold publicly today:
- PR is reported, not hidden. Player profile pages display rolling PR alongside ELO, calculated against the same reference oracle the wider community uses.
- No bot is calibrated to lose. The graded weak bots play their published level โ they make errors at their published rate, not more, not less. There is no algorithmic "adjustment" of bot strength based on player frustration or session retention metrics. The dice are uniform and the bot's evaluation is its published evaluation, every game, for every player.
See Also
- Bots & AI โ engine lineage and RNG integrity.
- Mathematics: Match Equity โ the MET that PR calculations consume.
- History โ the 1996โ2008 era and the bot family's original deployment.
- Glossary โ formal definitions for PR, ELO, mEMG, n-ply.