Performance Ratings (PR) & ELO: Measuring Backgammon Skill

There are two skill-measurement systems in serious competitive backgammon. ELO is the rating-based system inherited from chess, refined for backgammon's higher variance. Performance Rating (PR) is the error-rate-based system that emerged in the late 1990s once neural-network engines became strong enough to serve as evaluation oracles. Both are in active use, and a serious competitive profile carries both โ€” an ELO rating for matches played, and a PR for the quality of play across them.

This page covers the formal definition of each, the modern reference benchmarks, the historical PR/ELO profile of the original GamesGrid bot family (GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, MrHyperBot), and the framework the 2026 platform uses to surface these metrics in player profiles and bot opponent selection.

For the underlying engines and lineage, see Bots & AI. For the math that PR depends on (equity, match equity, gammon prices), see the mathematics pillar.


1. Performance Rating (PR)

Performance Rating measures the quality of play across a match (or sequence of matches) by comparing each move and each cube decision against a reference bot's evaluation. The metric is expressed in millipoints lost per move โ€” abbreviated mEMG (milli-Equivalent-Money-Game) โ€” and has the property that lower is better. A PR of zero is perfect play matching the reference bot's recommendations on every decision.

1.1 The PR Calculation

Formally, for a match of NN non-forced moves:

PR=1000Nร—โˆ‘i=1N(Eibestโˆ’Eiactual)\text{PR} = \frac{1000}{N} \times \sum_{i=1}^{N} \left( E_i^{\text{best}} - E_i^{\text{actual}} \right)

where EibestE_i^{\text{best}} is the equity (in money-game points) of the best move available on turn ii, and EiactualE_i^{\text{actual}} is the equity of the move the player actually made. The factor of 1000 converts to millipoints.

Cube decisions are folded into the same formula by treating them as moves with their own equity loss: an incorrect take is the equity loss compared to a correct drop (or vice versa), divided across the move count of the surrounding game in the standard convention.

The reference bot for the modern PR standard is eXtreme Gammon (XG2) at 4-ply evaluation, often with truncated rollouts for high-leverage positions. GNU Backgammon at 2-ply or 3-ply is the open-source equivalent and is used by most amateur analysis tools. The two reference oracles produce PRs within roughly 0.5 mEMG of each other across typical matches.

1.2 Reference Benchmarks

The following PR benchmarks are widely cited in modern competitive backgammon. They describe the typical PR of players in each skill band when their matches are analysed against XG2:

Skill bandPR (mEMG)Examples
World-class< 3.0Top tournament finishers, World Championship contenders.
Expert3.0 โ€“ 5.0Strong tournament regulars; PR 4.0 is the loose threshold for "expert."
Strong club player5.0 โ€“ 8.0Active competitive player with significant study background.
Intermediate8.0 โ€“ 12.0Solid recreational player with structured study.
Casual / beginner12.0 +Less-experienced player; substantial scope for improvement.

These are not strict cutoffs. PR varies match-to-match โ€” a strong player can play a "PR 8" match through bad concentration, and an intermediate player can play a "PR 4" match by getting lucky with simple positions. The signal becomes reliable across 50+ matches.

1.3 PR Variance and Sample Size

A single match has a PR standard deviation of roughly 2.0 mEMG for typical match lengths (5 to 9 points), driven primarily by the random selection of positions encountered. The standard error of an aggregate PR across nn matches falls as 1/n1/\sqrt{n}, so:

Sample size (matches)PR standard error
1~2.0 mEMG
10~0.63 mEMG
50~0.28 mEMG
100~0.20 mEMG
500~0.09 mEMG

A reliable PR comparison between two players requires roughly 100+ matches against comparable opponent pools. This is the reason serious skill-assessment tournaments report rolling PR over the player's last NN matches, rather than a single-match PR โ€” a single-match number is statistically noisy enough to be misleading.


2. ELO

The ELO rating system was developed for chess by Arpad Elo in 1960 and adapted for backgammon's higher variance by Kevin Bastian and others in the early FIBS days (1992โ€“1995). FIBS used the modified formula that became the de facto standard across online backgammon servers including GamesGrid.

2.1 The FIBS ELO Formula

Backgammon ELO adjusts the standard chess formula by weighting rating changes by match length โ€” a longer match is a more reliable skill measurement and therefore moves the ratings more.

E=11+10(Roโˆ’Rp)/2000โ‹…ME = \frac{1}{1 + 10^{(R_o - R_p) / 2000 \cdot \sqrt{M}}}

where EE is the player's expected match-winning probability, RoR_o is the opponent's rating, RpR_p is the player's rating, and MM is the match length in points.

The rating change after a match:

ฮ”R=Kโ‹…Mโ‹…(actualโˆ’E)\Delta R = K \cdot \sqrt{M} \cdot (\text{actual} - E)

where actual is 1 for a win and 0 for a loss, and KK is a constant โ€” typically 4 for established players and higher for new players (so their ratings stabilise faster).

The system is calibrated so that a 100-point rating differential corresponds approximately to a 64% expected win probability in a single 1-point match โ€” but the longer-match weighting means a 100-point differential in a 13-point match implies a much greater (~75%) expected win probability.

2.2 ELO Benchmarks

GamesGrid (1996โ€“2008) used a base rating of 1500 for new accounts, matching the FIBS convention. The original published rating bands:

ELO bandSkill bracket
Below 1300Novice
1300 โ€“ 1500Beginner
1500 โ€“ 1700Intermediate
1700 โ€“ 1900Strong / Expert
1900 โ€“ 2100Master / Tournament regular
2100+Top-tier / international competitor

These bands have remained relatively stable across the 1996โ€“2026 period. The 2026 platform uses the same base rating of 1500 with the FIBS-style match-length-weighted ELO formula, ensuring rating continuity for returning veterans of the original server.


3. The Original GamesGrid Bot Profiles

The 1996โ€“2008 GamesGrid bot family is documented in detail on the Bots & AI page. The bots were graded by induced error frequency applied to a common GNUbg-based evaluation engine, producing a controlled skill ladder. Their published ELO ratings (low / average / high across many thousands of matches) โ€” these values are quoted verbatim from the archived 2003โ€“2004 GamesGrid FAQ and are authentic historical data:

BotEvaluation depthELO lowELO avgELO highModel-derived PR estimate
GG Forever2-ply, Life Members only185019202114~4 โ€“ 5
GG Raccoon0-ply (no lookahead)185019202114~5 โ€“ 6
GG Otter0-ply + induced errors154317011827~9 โ€“ 11
GG Weasel0-ply + frequent induced errors141015161652~13 โ€“ 16
GG Chipmunk0-ply + high-frequency induced errors117112751487~18 โ€“ 22
MrHyperBotHypergammon database (exact)179719532109Perfect play in the hypergammon variant

Caveat on the PR column. The PR-equivalent values are model-derived estimates inferred from the bots' published ELO ranges and from the typical relationship between ELO and PR observed in modern competitive play. The original GamesGrid match logs are not currently accessible for direct XG2 re-evaluation; if those logs become available, the column will be replaced with measured values. Treat the published PR estimates as rough banding rather than precise rating.

A few observations from the table:

Standard Deviation of Error Rates (model-derived)

The original bots' induced-error patterns were stochastic, not deterministic โ€” a fixed proportion of moves were forced to suboptimal alternatives chosen at random from positions where the equity loss fell within a target range. The expected standard deviation per match of the resulting error rate (in mEMG) is governed by the bot's mean error rate and the typical SD/mean ratio for stochastic induced-error systems; the model-derived estimates are:

BotMean error rate (estimate)SD per match (estimate, mEMG)
GG Forever (2-ply)~4.5~1.0
GG Raccoon (0-ply)~5.0~1.2
GG Otter~10.0~2.0
GG Weasel~15.0~2.5
GG Chipmunk~20.0~3.0

Caveat. These are model-derived estimates based on the typical SD/mean ratio for stochastic induced-error systems and the bots' published ELO ranges โ€” not direct measurements. The higher-error bots are expected to show higher match-to-match variance, which is consistent with the substantial "rating swing" behaviour documented in the bots' operational ELO histories.


4. The 2026 Bot Framework

The new GamesGrid retains the design philosophy of the original bot family: a graded skill ladder, named characters, induced-error mechanisms for the weaker bots, and full-strength NN play for the top of the ladder. The original GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, and MrHyperBot are all reconstituted on the new engine, calibrated where possible to match their historical playing fingerprints. They are joined by a new generation of named bot opponents that fill out the full skill ladder, with character profiles spanning beginner through world-championship calibre.

The specific bot roster, the structure of the Career Mode bot leagues that they populate, and the precise PR profile of the strongest 2026 bots will be published closer to launch.

Two commitments hold publicly today:

  1. PR is reported, not hidden. Player profile pages display rolling PR alongside ELO, calculated against the same reference oracle the wider community uses.
  2. No bot is calibrated to lose. The graded weak bots play their published level โ€” they make errors at their published rate, not more, not less. There is no algorithmic "adjustment" of bot strength based on player frustration or session retention metrics. The dice are uniform and the bot's evaluation is its published evaluation, every game, for every player.

See Also


Footnotes