Performance Ratings (PR) & ELO: Measuring Backgammon Skill

There are two skill-measurement systems in serious competitive backgammon. ELO is the rating-based system inherited from chess, refined for backgammon's higher variance. Performance Rating (PR) is the error-rate-based system that emerged in the late 1990s once neural-network engines became strong enough to serve as evaluation oracles. Both are in active use, and a serious competitive profile carries both — an ELO rating for matches played, and a PR for the quality of play across them.

This page covers the formal definition of each, the modern reference benchmarks, the historical PR/ELO profile of the original GamesGrid bot family (GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, MrHyperBot), and the framework the 2026 platform uses to surface these metrics in player profiles and bot opponent selection.

For the underlying engines and lineage, see Bots & AI. For the math that PR depends on (equity, match equity, gammon prices), see the backgammon mathematics page.

1. Performance Rating (PR)

Performance Rating measures the quality of play across a match (or sequence of matches) by comparing each move and each cube decision against a reference bot's evaluation. The metric is expressed in millipoints lost per move — abbreviated mEMG (milli-Equivalent-Money-Game) — and has the property that lower is better. A PR of zero is perfect play matching the reference bot's recommendations on every decision.

1.1 The PR Calculation

Formally, for a match of $N$ non-forced moves:

$\text{PR} = \frac{1000}{N} \times \sum_{i=1}^{N} \left( E_i^{\text{best}} - E_i^{\text{actual}} \right)$

where $E_i^{\text{best}}$ is the equity (in money-game points) of the best move available on turn $i$ , and $E_i^{\text{actual}}$ is the equity of the move the player actually made. The factor of 1000 converts to millipoints.

Cube decisions are folded into the same formula by treating them as moves with their own equity loss: an incorrect take is the equity loss compared to a correct drop (or vice versa), divided across the move count of the surrounding game in the standard convention.

The reference bot for the modern PR standard is eXtreme Gammon (XG2) at 4-ply evaluation, often with truncated rollouts for high-leverage positions. GNU Backgammon at 2-ply or 3-ply is the open-source equivalent and is used by most amateur analysis tools. The two reference oracles produce PRs within roughly 0.5 mEMG of each other across typical matches.

1.2 Reference Benchmarks

The following PR benchmarks are widely cited in modern competitive backgammon. They describe the typical PR of players in each skill band when their matches are analysed against XG2:

Skill band	PR (mEMG)	Examples
World-class	< 3.0	Top tournament finishers, World Championship contenders.
Expert	3.0 – 5.0	Strong tournament regulars; PR 4.0 is the loose threshold for "expert."
Strong club player	5.0 – 8.0	Active competitive player with significant study background.
Intermediate	8.0 – 12.0	Solid recreational player with structured study.
Casual / beginner	12.0 +	Less-experienced player; substantial scope for improvement.

These are not strict cutoffs. PR varies match-to-match — a strong player can play a "PR 8" match through bad concentration, and an intermediate player can play a "PR 4" match by getting lucky with simple positions. The signal becomes reliable across 50+ matches.

1.3 PR Variance and Sample Size

A single match has a PR standard deviation of roughly 2.0 mEMG for typical match lengths (5 to 9 points), driven primarily by the random selection of positions encountered. The standard error of an aggregate PR across $n$ matches falls as $1/\sqrt{n}$ , so:

Sample size (matches)	PR standard error
1	~2.0 mEMG
10	~0.63 mEMG
50	~0.28 mEMG
100	~0.20 mEMG
500	~0.09 mEMG

A reliable PR comparison between two players requires roughly 100+ matches against comparable opponent pools. This is the reason serious skill-assessment tournaments report rolling PR over the player's last $N$ matches, rather than a single-match PR — a single-match number is statistically noisy enough to be misleading.

2. ELO

The ELO rating system was developed for chess by Arpad Elo in 1960 and adapted for backgammon's higher variance by Kevin Bastian and others in the early FIBS days (1992–1995). FIBS used the modified formula that became the de facto standard across online backgammon servers including GamesGrid.

2.1 The FIBS ELO Formula

Backgammon ELO adjusts the standard chess formula by weighting rating changes by match length — a longer match is a more reliable skill measurement and therefore moves the ratings more.

$E = \frac{1}{1 + 10^{(R_o - R_p) / 2000 \cdot \sqrt{M}}}$

where $E$ is the player's expected match-winning probability, $R_o$ is the opponent's rating, $R_p$ is the player's rating, and $M$ is the match length in points.

The rating change after a match:

$\Delta R = K \cdot \sqrt{M} \cdot (\text{actual} - E)$

where actual is 1 for a win and 0 for a loss, and $K$ is a constant — typically 4 for established players and higher for new players (so their ratings stabilise faster).

The system is calibrated so that a 100-point rating differential corresponds approximately to a 64% expected win probability in a single 1-point match — but the longer-match weighting means a 100-point differential in a 13-point match implies a much greater (~75%) expected win probability.

2.2 ELO Benchmarks

GamesGrid (1996–2008) used a base rating of 1500 for new accounts, matching the FIBS convention. The original published rating bands:

ELO band	Skill bracket
Below 1300	Novice
1300 – 1500	Beginner
1500 – 1700	Intermediate
1700 – 1900	Strong / Expert
1900 – 2100	Master / Tournament regular
2100+	Top-tier / international competitor

These bands have remained relatively stable across the 1996–2026 period. The 2026 platform uses the same base rating of 1500 with the FIBS-style match-length-weighted ELO formula, ensuring rating continuity for returning veterans of the original server.

3. The Original GamesGrid Bot Profiles

The 1996–2008 GamesGrid bot family is documented in detail on the Bots & AI page. The bots were graded by induced error frequency applied to a common GNUbg-based evaluation engine, producing a controlled skill ladder. Their published ELO ratings (low / average / high across many thousands of matches) — these values are quoted verbatim from the archived 2003–2004 GamesGrid FAQ and are authentic historical data:

Bot	Evaluation depth	Closest GNU Backgammon level	ELO low	ELO avg	ELO high	Model-derived PR estimate
GG Forever	Higher-tier evaluation (likely 1-ply or 2-ply), Life Members only	World Class (if 2-ply) or Expert (if 1-ply)	1850	1920	2114	~4 – 5
GG Raccoon	0-ply (no lookahead)	Advanced	1850	1920	2114	~5 – 6
GG Beaver	0-ply, same engine as GG Raccoon	Advanced	1791	1935	2086	~5 – 6
GG Tutor	0-ply, equivalent to GG Raccoon; commentary on player errors; unrated matches only	Advanced	n/a (unrated)	n/a	n/a	~5 – 6
GG Otter	0-ply + induced errors	Intermediate	1543	1701	1827	~9 – 11
GG Weasel	0-ply + frequent induced errors	Casual	1410	1516	1652	~13 – 16
GG Chipmunk	0-ply + high-frequency induced errors	Beginner	1171	1275	1487	~18 – 22
MrHyperBot	Hypergammon position database (exact play)	n/a (variant-specific oracle)	1797	1953	2109	Perfect play in the hypergammon variant

Caveat on the PR column. The PR-equivalent values are model-derived estimates inferred from the bots' published ELO ranges and from the typical relationship between ELO and PR observed in modern competitive play. The original GamesGrid match logs are not currently accessible for direct XG2 re-evaluation; if those logs become available, the column will be replaced with measured values. Treat the published PR estimates as rough banding rather than precise rating.

Caveat on the GNU level column. The mapping to GNU Backgammon's modern named difficulty levels (Beginner / Casual / Intermediate / Advanced / Expert / World Class / Supremo / Grand Master) is a practical equivalence, not a documented one — the original 1996-2008 FAQ pre-dates GNU's modern named-level taxonomy and used its own labels ("Novice", "Beginner", "Intermediate"). The mapping above is the closest match between each GG bot's described approach (lookahead depth + induced-error frequency) and the corresponding GNU level's expected play.

A few observations from the table:

GG Forever and GG Raccoon were the same underlying engine; Raccoon used 0-ply (no lookahead), and Forever used higher-tier evaluation. The original 1996-2008 FAQ documents "no lookahead" explicitly only for GG Raccoon; it does not pin down the depth for GG Forever, so the encyclopedia treats it as "likely 1-ply or 2-ply" per the general n-ply discussion in the same FAQ. Either depth produces a meaningful PR improvement over Raccoon — roughly 0.5–1.0 mEMG in modern terms — but did not produce dramatically different ELO outcomes against the typical 1996–2008 player pool, because most opponents were not skilled enough to systematically exploit the 0-ply weaknesses.
GG Beaver ran the same 0-ply full-strength engine as GG Raccoon. Its operational ELO range (1791–2086) is marginally lower than Raccoon's (1850–2114), within the natural variance the bots show across thousands of matches — there is no documented evidence that Beaver was a structurally weaker bot.
GG Tutor also ran the same 0-ply full-strength engine. It was distinguished by per-move commentary when the player made a measurable evaluation error, and by accepting unrated matches only. It was the platform's instructional opponent — a way for improving players to receive engine-grade feedback without rating exposure.
MrHyperBot played a fully solved variant (hypergammon, 3-checker game) with an exact-position database. Its PR within the variant is, in principle, zero — every move was the proven best — but in the smaller hypergammon state space the dice contribution to outcomes is proportionally larger, so its ELO ratings still varied 312 points over its operational lifetime purely from variance.
GG Chipmunk at the bottom of the ladder was designed to be beatable by recreational players. Its 0-ply + high-error mode produced highly entertaining games (more blunders, more contact, more action) at the cost of game quality. This was a deliberate design choice for the bot's role.

Standard Deviation of Error Rates (model-derived)

The original bots' induced-error patterns were stochastic, not deterministic — a fixed proportion of moves were forced to suboptimal alternatives chosen at random from positions where the equity loss fell within a target range. The expected standard deviation per match of the resulting error rate (in mEMG) is governed by the bot's mean error rate and the typical SD/mean ratio for stochastic induced-error systems; the model-derived estimates are:

Bot	Mean error rate (estimate)	SD per match (estimate, mEMG)
GG Forever (higher-tier)	~4.5	~1.0
GG Raccoon (0-ply)	~5.0	~1.2
GG Beaver (0-ply)	~5.0	~1.2
GG Tutor (0-ply, unrated)	~5.0	~1.2
GG Otter	~10.0	~2.0
GG Weasel	~15.0	~2.5
GG Chipmunk	~20.0	~3.0

Caveat. These are model-derived estimates based on the typical SD/mean ratio for stochastic induced-error systems and the bots' published ELO ranges — not direct measurements. The higher-error bots are expected to show higher match-to-match variance, which is consistent with the substantial "rating swing" behaviour documented in the bots' operational ELO histories.

4. The 2026 Bot Framework

The new GamesGrid retains the design philosophy of the original bot family: a graded skill ladder, named characters, induced-error mechanisms for the weaker bots, and full-strength NN play for the top of the ladder. The original GG Forever, GG Raccoon, GG Otter, GG Weasel, GG Chipmunk, and MrHyperBot are all reconstituted on the new engine, calibrated where possible to match their historical playing fingerprints. They are joined by a new generation of named bot opponents that fill out the full skill ladder, with character profiles spanning beginner through world-championship calibre.

The specific bot roster, the structure of the Career Mode bot leagues that they populate, and the precise PR profile of the strongest 2026 bots will be published closer to launch.

Two commitments hold publicly today:

PR is reported, not hidden. Player profile pages display rolling PR alongside ELO, calculated against the same reference oracle the wider community uses.
No bot is calibrated to lose. The graded weak bots play their published level — they make errors at their published rate, not more, not less. There is no algorithmic "adjustment" of bot strength based on player frustration or session retention metrics. The dice are uniform and the bot's evaluation is its published evaluation, every game, for every player.