After the first two rounds of the Sudoku GP, I found that some puzzles' point values didn't reflect their real difficulty. Some puzzles' point values are much higher than their real difficulty, such as the Extra Regions Sudoku and the Irregular Sudoku in Sudoku GP2, resulting in 8 competitors with 1000+ points and 42 competitors with (the total) 719+ points.
In my opinion, the point values should reflect puzzles' real difficulty, but should not be too overvalued to scare competitors. The GP should be a test of brain power, not psychology.
I suggest the authors contact more testers and estimate the puzzles' difficulty more precisely. The total points can change, but the scale should remain at 10 pts/min and not change.
Efficiency should be considered
When determining the point values of puzzles, testers should consider not only the actual time they spend on each puzzle, but also their efficiency of earning points. This is because not all testers are top solvers and some testers' "efficiency" of earning points is lower than 10 pts/min. If efficiency is not considered, it must result in some puzzles having point values higher or lower than their actual difficulty.
For example, a tester spends 8 minutes solving a puzzle. If he is a top solver around the world and can get 10 points per minute in GP or WSC/WPC, then his calculated point value can be 80 points. However, if he can only get 7 points per minute in competitions, then his calculated point value should only be 56 points. And then, use the average of all testers' calculated point values as its final point value.
Trial should also be considered
Moreover, because GP and WSC/WPC don't care how competitors get the solution, sometimes a competitor may use trial to get the solution, which significantly reduces his/her solving time. Therefore, to balance, point values should be reduced a little if the puzzle is easy via trial, even it's hard via logic.
Catch 22: Puzzle Difficulty vs. Tester's Solving Skill
Couldn't agree more with the point that the GP should be a test of brain power, not psychology. I've always interpreted that as meaning you want a round where there isn't too much "points inequality" in the round - which you can measure by comparing number of puzzles to their total points values. If 3-4 puzzles make up 50% of the points that doesn't feel desirable to me. It's even worse when the points don't seem accurate. I haven't had a go at round 2, so can't really comment more than that.
I'll add the observation that you can never really know how hard a puzzle unless you have a good idea of the solving ability of your testers. But then you can never really know how good your testers are unless you have a good idea of the difficulty of the puzzles you get them to solve. The problem is exactly the same thing to everyone taking the round. It's catch 22.
The GP has a further problem to overcome, because it is to remain a credible competition then the scores from one round to the next need to be directly comparable given the best 6 from 8 mechanic. The scoring this round would seem to have inflated scores compared to the last couple of years! Someone like me who has missed round 2 is presumably now at a big disadvantage for the rest of the series, assuming the scoring returns to a more stable level.
I don't think I've understood clearly how you would use solving efficiency to test directly (rather than just their times) - some testers will clearly have favourite types and other types they struggle with - determining a definitive value for an individual's solving efficiency isn't an easy thing to do. The current testing approach ought to work if you have a sufficient number of testers, apply a stable benchmark (I often use the median) whilst applying some expert judgement to take into account variance in testing times (which often arise where there is a quick way to guess/use uniqueness).
For what it's worth, I think the puzzle GP has a harder problem with puzzle variance compared to the sudoku GP. I think it also has a bigger problem (at least as far as assigning points values) with the top solvers increasing/accelerating their points efficiency on the hardest puzzles compared to everyone else. Maybe this is where I need to thing about solving efficiency (which has inverse relationship to solving time) a bit more...
Puzzle points
I find this discussion very interesting, because the problem is probably more structural than it seems.
In theory, assigning points based on testers’ solving times sounds straightforward. In practice, it is extremely hard to calibrate difficulty before the round, because:
Testers have different strengths and preferences.
Some puzzles are highly sensitive to “seeing the key idea”.
Some are much easier with trial / uniqueness.
And efficiency varies significantly between individuals.
There is also another dimension: player behaviour during the round.
We have around 1000 competitors. That is a very strong statistical sample — but only for the round as a whole, not for individual puzzles. We don’t know:
How many people seriously attempted a puzzle.
How many skipped it because it looked scary.
How many ran out of time.
How many got stuck and abandoned it.
Only solved puzzles are recorded. So participation bias makes post-hoc per-puzzle evaluation tricky.
From my personal experience, I often get stuck on one puzzle and I genuinely don’t know whether:
the puzzle is underrated,
or it simply doesn’t match my strengths.
There are types I can solve at close to “top level speed”.
There are others where I am much slower — not necessarily because they are harder, but because there is less training material available and I have weaker pattern recognition in that type.
Because of that, I wonder whether we could partially normalize at the round level rather than trying to perfectly calibrate each puzzle.
One possible approach:
Keep the pre-round testing process as is (based on tester medians etc.).
After the round, compute round-wide statistics:
median score,
standard deviation,
score distribution curve,
proportion of full solvers.
Compare these values to historical GP rounds.
If the round is statistically inflated or deflated, apply a global normalization factor to all scores in that round.
This would:
Preserve relative ordering between competitors.
Reduce the impact of one “inflated” or “deflated” round on the overall series ranking.
Avoid trying to over-engineer per-puzzle efficiency corrections, which are extremely hard to measure objectively.
If we want to go one step further, one could even consider something similar to rating systems (Elo-like thinking), where puzzle difficulty is gradually estimated over multiple rounds rather than fixed in advance. But that would require structural changes.
In my opinion, the key is not to perfectly predict puzzle difficulty (which may be impossible), but to ensure that:
rounds are statistically comparable,
and no single round becomes disproportionately decisive due to calibration drift.
Curious to hear what others think about post-round normalization rather than only pre-round estimation.
Personally, I treat the GP primarily as fun — I’m not really focused on comparing myself with others, but rather on whether I enjoyed the round and whether the puzzles were satisfying. If I manage to solve many, that’s great. If I find one particularly interesting puzzle that takes me longer but gives real satisfaction, that’s equally fine. I’m joining this discussion mainly because the scoring question itself is intellectually interesting to me — especially from a mathematical perspective.