Inferring Boulder Grades

The problem with grades

Every boulder problem carries a grade — a single number meant to capture how hard the climb is. In practice, grades are noisy. The first ascensionist proposes one, later climbers agree or disagree, and the consensus drifts over the years. Two problems sharing the same grade can feel wildly different, and a “soft” 7A may be easier than a “hard” 6C+.

The grade is a label. What we actually want is the underlying difficulty — a continuous, comparable measure that comes from data, not consensus.

The data

The model is trained on ~1.5 million ascent logs scraped from public climbing databases, covering ~50,500 boulders across thousands of crags, attempted by ~31,000 climbers. Each log records who climbed what, and whether they flashed (sent first try), sent (after multiple attempts), or just tried (logged an attempt without success).

For every observed (climber, boulder) pair, we also sample negative observations: boulders in crags the climber visited but didn’t log. These negatives are ambiguous — they could mean “didn’t try” or “tried and failed but didn’t log it” — and the model explicitly handles this ambiguity. During training, negatives are subsampled at a 10:1 negative-to-positive ratio.

A Bayesian model of bouldering

I frame every ascent as a sequential decision cascade:

Try? → Send? → Flash?

A climber first decides whether to attempt a problem. If they try, they may succeed. If they succeed, they may have done so on the first go. Each step is a logistic function of latent traits:

Climber ability ($\theta_i$) — how strong the climber is.
Climber prolificity ($\alpha_i$) — how likely a climber is to try anything, regardless of difficulty.
Boulder difficulty ($d_j$) — the latent, inferred hardness of the problem.
Boulder popularity ($\pi_j$) — how appealing a boulder is, all else equal.

The try probability is the heart of the model:

$$ \text{logit}(P_{\text{try}}) = \alpha_i + \pi_j - \gamma_i \cdot (\theta_i - d_j - \mu_i)^2 $$

The quadratic penalty $(\theta_i - d_j - \mu_i)^2$ captures the Goldilocks effect: climbers don’t try problems uniformly — they gravitate toward boulders near their own level. A strong climber skips the V0 warm-ups; a beginner doesn’t project V10. The per-climber parameters $\gamma_i$ (selectivity) and $\mu_i$ (preferred difficulty offset) let each climber’s window vary: some are specialists who only try things at their limit, others are generalists.

The send and flash probabilities are simpler — pure ability vs. difficulty:

$$ \text{logit}(P_{\text{send}} \mid \text{try}) = \theta_i - d_j $$ $$ \text{logit}(P_{\text{flash}} \mid \text{send}) = \theta_i - d_j - \beta $$

A global intercept $\beta$ controls how much harder flashing is than sending. All latent variables have hierarchical priors: climber abilities are drawn from a population distribution, and boulder difficulties are nested within sectors (3,927 of them), so problems in the same area share a difficulty baseline. This regularizes estimates for boulders with few ascents.

Training at scale

With ~82,000 latent parameters and millions of observations, MCMC would be impractical. The model is fit via Automatic Differentiation Variational Inference (ADVI) with a full-rank Gaussian approximation, using minibatch stochastic optimization. Training ran for 500,000 iterations with the Adam optimizer (lr = 0.003, batch size = 4,096). After convergence, 3,000 posterior draws were sampled to produce credible intervals for every estimate.

The key validation: how well does the inferred difficulty $d_j$ predict the community grade? A weighted regression of community grade onto $d_j$ yields R² = 0.75. Including popularity $\pi_j$ lifts this to R² = 0.79, suggesting that popular boulders tend to receive slightly inflated grades.

The model agrees with the community — mostly

The scatter of inferred difficulty against the real V-grade shows a tight linear relationship:

Predicted difficulty (Elo) on the y-axis vs real V-grade on the x-axis. Dots are sized by number of logged ascents. A line connects weighted medians for each V-grade, weighted by ascent count. R²=0.75. — Forecasted difficulty $d_j$ (Elo) vs. real V-grade. Each dot is a boulder, sized by its number of logged ascents. The line connects the weighted median predicted difficulty for each grade, weighted by number of ascents. The model recovers a strong linear signal from ascent patterns alone.

But the deviations are the interesting part. Dots that sit above the median line for their grade are sandbags — harder than the consensus suggests. Those below are soft touches.

Popular boulders, inflated grades

Forecasted popularity on the y-axis vs forecasted difficulty on the x-axis. Dots are colored by consensus grade and sized by number of ascents. — Forecasted popularity $\pi_j$ vs. forecasted difficulty $d_j$. Each dot is a boulder: color is the consensus grade, size is the number of logged ascents.

Boulder popularity $\pi_j$ and difficulty $d_j$ are only weakly correlated, confirming the model successfully disentangles them. But the fact that adding $\pi$ to the grade regression improves R² from 0.75 to 0.79 suggests a real effect: problems that attract more traffic tend to carry slightly inflated grades. The large, light-colored dots toward the upper-right — popular, hard problems — show where consensus may drift upward as more climbers log sends.

The strongest climbers — found by the model

The model has never seen a competition result, a podium, or a name. It only knows ascent logs. Yet when we rank climbers by posterior ability $\theta_i$, the top of the list reads like a who’s-who of professional bouldering. The ranking uses the lower bound of the 95% credible interval, not the posterior mean — a climber with few logged ascents needs an exceptional record to rank high, because the model is less certain about them.

Rank	Climber	Ability	95% low	Ascents logged
1	Jules Marchaland	4.29	4.11	53
2	Vadim Timonov	4.04	3.89	166
3	Noah Wheeler	4.05	3.85	93
4	Andrew Nimmer	3.86	3.73	336
5	Matt Fultz	3.72	3.64	371
6	Mejdi Schalck	3.73	3.55	53
7	Adam Ondra	3.70	3.55	156
8	William Bosi	3.72	3.55	41
9	Kali Tolsma	3.81	3.47	23
10	Fabrice Landry	3.72	3.44	334
11	Pietro Vidi	3.52	3.41	180
12	Yannick Flohé	3.64	3.40	54
13	Nimrod Marcus	3.63	3.37	50
14	James Webb	3.41	3.36	916
15	David Firnenburg	3.38	3.32	393
16	Thilo Jeldriksønn	3.45	3.32	159
17	Peter Satt	3.38	3.30	175
18	Solomon Kemball	3.58	3.30	45
19	Keita Mogaki	3.44	3.30	91
20	Piotr Schab	3.44	3.30	154

This is remarkably good signal. Jules Marchaland — World Cup finalist — tops the list with only 53 logged ascents, and Ondra, Bosi, Schalck, Fultz, Flohé follow: the model recovers the pro circuit from nothing but tick lists. Notice Kali Tolsma — a top-three posterior mean, but with only 23 logged ascents the interval is wide and the ranking demotes them.

The hardest boulders

The same trick works for boulders. Ranking by the lower bound of the difficulty credible interval, these are the problems the model is most confident are hard — shown with their community grade and the model’s predicted V-grade:

Rank	Boulder	Area	Grade	Predicted	Ascents
1	Power of Now	Magic Wood	8B+ (V14)	V15.8	26
2	Foxy Lady (dyno)	Magic Wood	8A (V11)	V15.2	33
3	Never ending story 1	Magic Wood	8A+ (V12)	V15.0	32
4	Ephyra	Chironico	8C+ (V16)	V15.6	7
5	Never ending story 2	Magic Wood	8A (V11)	V14.6	84
6	White Stripe	Brione	8A+ (V12)	V15.1	10
7	Mystic Stylez	Magic Wood	8B+ (V14)	V14.6	29
8	Pagan Poetry Low	Left Fork	8B (V13)	V14.5	42
9	Direct North	Buttermilks	8B+ (V14)	V14.6	34
10	The Mandala Sit	Buttermilks	8B (V13)	V14.4	42

Note what the list is not: it’s not simply the highest-graded problems in the dataset. Foxy Lady, graded 8A, ranks second — the model predicts it climbs more like V15. Only people who consistently send far harder problems ever manage it.

Explore the results

Every boulder in the dataset gets a predicted difficulty, a credible interval, and a residual — how much harder or softer it is than the median boulder of its grade. You can search by name, filter by area and grade, and sort by how much the model disagrees with the consensus.

Browse the full table →

Caveats & what’s next

This is a first pass. The model currently treats all ascents as independent, ignoring the fact that climbers improve over time. It doesn’t model boulder style (slab vs. roof vs. crimp), which likely explains some of the residual variance. And the negative sampling — treating unseen boulders as ambiguous — is a pragmatic simplification that could be improved with explicit “didn’t try” annotations.

Still, the signal is strong: from anonymous tick lists, the model recovers a difficulty scale that aligns with human judgment, spots sandbags and soft touches, and ranks the world’s best climbers. Not bad for something that has never touched rock.