News You Can Use: Stockfish Normalizes Its Evaluation
Hopefully the end of human +/- translating to +3.5 for Stockfish.
Before Stockfish took a page out of the Alpha Zero playbook and went with the neural net approach, its evaluations were in keeping with human understanding. If White had an extra pawn and everything else was pretty normal, one would expect to see an evaluation around +1, give or take one or two tenths of a point. After it joined the neural net revolution, the evaluations changed, massively: now advantages that seemed clear or on the cusp between a clear and a decisive plus shot up massively. I’ve repeatedly found that advantages that older engines might have thought were around +1.5 have more than doubled, as if an extra pawn and a bit of extra comfort translates to an extra piece or more.
Every so often I check on the Stockfish development page to try out the latest version of their engine; today, I did so and found this:
Normalize evaluation
Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.
The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):
June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)
With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model
This is good news, twice over. First, the evals are more in keeping with common sense (which is what we as humans need); second, they will be consistent. Regarding “common sense”, I just compared the new version with a very recent one, and the difference was dramatic. A position the old version gave as +4.2 was around +2.6 on the new one, and another position that was around +2.5 went down to +1.5-1.6, which also squared with my sense of the position.
So, kudos to Stockfish and Joost VandeVondele in particular for doing this. The Stockfish folks, along with Mark Crowther of TWIC, have made the chess world an immeasurably better place over the years for their work, and at no cost to us.
News You Can Use: Stockfish Normalizes Its Evaluation
Love the shout-out to Mark Crowther. The issue I have with eval pegged to a fixed win % is that this does not apply along the scale of all human skill levels (in games against players of the same rating R). This is the subject of my article https://rjlipton.wpcomstaging.com/2018/09/07/sliding-scale-problems/, whose main graphic heads the article on me in Time Magazine Online. But this may still work out fine. I certainly welcome the lowering of evals in unbalanced positions especially, for which I currently have to make an ad-hoc adjustment to balance the sample sizes.