Contents

The AgentFit methodology: how to measure API-documentation AI-readiness deterministically

Automatic Translation

This article has been translated automatically from Russian to English. The original is available in Russian.

From intuition to a metric

In the previous article I described 26 criteria for documentation AI-readiness and a ready-made prompt you can run any site through using a language model with web access. It worked as a checklist and as a talking point. But as a measurement it had two flaws I couldn’t forgive.

First, non-reproducibility. Running the same site through the language-model prompt twice gives two different results: the model sometimes finds llms.txt, sometimes gets lazy, and interprets “example realism” differently each time. The score drifted by ±5–10 points between runs. A metric that depends on the auditor’s mood is not a metric.

Second, no guarantees. A language model can hallucinate evidence exactly the way it hallucinates non-existent endpoints. The prompt explicitly forbade this (“training-data knowledge is NOT evidence”), but a prohibition in the prompt text is not the same thing as architectural impossibility.

So I rewrote the rubric in code. The result is AgentFit — a deterministic auditor in Go that emits byte-for-byte identical JSON for the same input, backs every score with a concrete HTTP response, and makes zero calls to a language model during the audit. This article is about the measurement methodology behind it: the formal setup, the scoring function, the two places where machine learning is still needed, the regression gate on reference sites, and a check of how the scale behaves on a corpus of ~2000 sites.

Let me separate two notions up front that are easy to conflate and that I keep apart throughout the article. Reliability is reproducibility: the same input yields the same output. AgentFit provides this fully and by construction. Validity is the question “are we even measuring what we want to measure?” That is far harder to guarantee, and the honest answer is: only partially — there’s a dedicated section on it at the end, with no sugar-coating. (Both terms and all the rest are in the Terms section below.)

This is not “how to write yet another linter.” It’s an attempt to answer the question: can you objectively and reproducibly measure something as fuzzy as “how ready is this documentation to be read by an agent” — and at what cost.

Formal setup

Call an audit a function

audit: BASE_URL → Report

where BASE_URL is the root URL of the documentation with no trailing slash (e.g. https://docs.stripe.com), and Report is a structured report. Internally, audit unfolds in two stages — BASE_URL → Env → Report (environment prefetch, then the fan-out of criteria); details are in the architecture section. The rubric R is an ordered set of 30 criteria grouped into 6 categories A…F:

R = { c₁, c₂, …, c₃₀ }

Each criterion cᵢ is a triple (IDᵢ, mᵢ, runᵢ):

  • IDᵢ — a stable identifier (A1, C1b, F4, …);
  • mᵢ ∈ ℕ — the criterion’s maximum points;
  • runᵢ: Env → Resultᵢ — a deterministic scoring procedure.

The maxima are normalized to sum to 100:

Σ mᵢ = 100

A criterion’s result is a record Resultᵢ; three of its fields matter for scoring:

  • statusᵢ ∈ { present, partial, absent, error, not_applicable };
  • scoreᵢ ∈ {0, 1, …, mᵢ} — an integer number of points;
  • evidenceᵢ = (url, http_status, snippet₅₀) — the evidence: a URL, a response code, and a snippet of the body.

The status-to-score relationship is fixed and verified on output:

StatusScoreMeaning
presentscoreᵢ = mᵢall required signals found
partial0 < scoreᵢ < mᵢsome signals
absentscoreᵢ = 0no signals, but the check ran fine
errorscoreᵢ = 0the check could not run (fetch, parse, or inference failed)
not_applicablescoreᵢ = 0the precondition is absent (e.g. nothing to assess — no OpenAPI spec)

The split between error and not_applicable arrived in v2.2.0; before that there was a single unknown status. Merging the two statuses again recovers the old unknown, so total scores stay comparable with the article. The distinction matters for diagnostics: error means “our check broke or the site blocked us,” not_applicable means “the site genuinely doesn’t have what we’re checking for.” Conflating them means confusing a defect in the measuring instrument with a fact about the thing being measured.

The scoring function

The total score is just a sum:

total = Σᵢ scoreᵢ          (0 ≤ total ≤ 100)

The category score for k ∈ {A,…,F}:

score(k) = Σ_{cᵢ ∈ k} scoreᵢ

No weighting coefficients “on the fly,” no normalization relative to other sites, no eyeballed rounding. A criterion’s weight is already baked into mᵢ — that is, into the structure of the rubric itself, not into the moment of scoring. This is a deliberate choice: the metric must be absolute, so a site’s score doesn’t change depending on which other sites happen to be in today’s sample. (Below, in the validity section, I’ll show that switching to relative, corpus-dependent scoring buys nothing — the ranking doesn’t change.)

The gating criterion

One criterion stands apart — E1 (content visible in plain HTML without JavaScript, 6 points). It is the gating criterion. If score(E1) = 0 (content renders only via JavaScript, and curl returns an empty shell <div id="root"></div>), the report gets an unreliable marker:

unreliable = (score(E1) = 0)  ∨  vk_trap_detected

The total is still computed — gating doesn’t zero the score, it disqualifies it as reliable. The logic is the same as in Google’s Lighthouse: “there is a value, but you can’t trust it.” A site can earn a decent score off plain-text .md versions layered over a single-page app — that’s what Anthropic does: the main HTML is a single-page-app shell (E1 fails), but a separate .md-suffixed route and a 76 MB llms-full.txt serve the same content as clean text, and real accessibility for the model doesn’t suffer. In this case the unreliable marker honestly says: “the main HTML is unusable, but the fallback channel saves it.”

vk_trap_detected is a separate detector, covered below; it catches sites that return HTTP 200 and the same HTML for any URL.

Requirements for the measurement methodology

For a number to be a metric rather than an opinion, the methodology must satisfy four requirements. These are not “implementation features” — they’re the measurement contract, and violating any of them makes the score meaningless.

  1. Determinism (reproducibility). For the same input, audit must return a byte-for-byte identical Report. Any source of non-determinism (map iteration order, an unsorted page sample, the current time, races between checks) is a measurement defect.
  2. Evidence-based. Every scoreᵢ is accompanied by evidenceᵢ — a real URL, an HTTP response code, and a body snippet ≤ 50 characters. A score without proof is invalid.
  3. No language model at runtime. Zero calls to language models during an audit. Any semantics that can’t be expressed with a regular expression is expressed by two small embedded classifiers (see below). The reason is requirement 1: a language model is non-deterministic by construction.
  4. Bounded cost. An audit is not a crawl. ≤ ~80 HTTP requests per site, a cap on concurrent connections to a single host, a body-size cap (10 MB), and a full audit in under ~30 seconds for a typical site.

What follows is how these requirements are met.

The rubric: categories, weights, and how they evolved

The original rubric (the article) had 26 criteria in five categories. The current one (AgentFit v2.3+) has 30 criteria in six. A category F. Agent Surface was added: the layer on top of the documentation that faces the programmatic agent (llms.txt as a discovery channel, WebMCP, an MCP server, DOM accessibility for the agent). To preserve the Σ mᵢ = 100 normalization, categories A–E were trimmed by exactly 10 points (A −5, B −2, C −1, D −1, E −1), and the freed-up 10 went to F.

CategoryWeight (article)Weight (now)What it checks
A. Discovery1813llms.txt, llms-full.txt, robots.txt with AI-bot rules, a clean sitemap.xml, discovery tags
B. Per-page artifacts2220a .md version of the page, JSON-LD, an absolute canonical, freshness, semantic <main>/<article>
C. API spec2524OpenAPI/Swagger/AsyncAPI at a predictable URL and its validity, Postman/SDK, endpoint-page structure
D. Content2019curl and SDK examples, payload realism, an error catalogue, auth and rate limits, a glossary, deprecation markers
E. Hygiene1514content without JavaScript (gating), stable URLs, version in the URL, working links, ToS and AI policy
F. Agent Surface10discovery-surface breadth, WebMCP, an MCP server (RFC 9728/8414), DOM accessibility for the agent
Total100100

The weights aren’t derived from first principles — they’re tuned so the code’s score lands close to the manual score from the original article on a handful of reference sites (section below). That’s fine: the rubric is an operationalization of the fuzzy construct “AI-readiness,” and the weights in it play a role analogous to item weights in a composite scale. But there’s a pleasant consequence I checked separately: the final ranking barely depends on the specific weights (Spearman’s rank correlation ρ = 0.960 under reweighting — what exactly that means and doesn’t mean, I unpack in the robustness section). There is arbitrariness in the choice of weights, but it has almost no effect on the order of sites.

Inside category F there’s a deliberate overlap with A: both A1 (the quality of llms.txt) and F1 (discovery-surface breadth) look at llms.txt. This isn’t accidental duplication — it’s the same “depth vs breadth” move as in Lighthouse: one criterion scores “how well it’s done,” the other “how many distinct surfaces even exist.”

How the auditor is built

An audit is two phases.

Phase 1 — buildEnv (prefetch). In parallel (via an errgroup), the resources that many criteria need at once are fetched: the homepage, /robots.txt, /sitemap.xml, /llms.txt, and an OpenAPI probe over a catalogue of 17 standard paths (/openapi.json, /swagger.json, /api-docs, …). The catalogue is tiered: the 6 common paths first, the other 11 only if the first ones miss; and for sub-paths under BASE_URL, each path is probed both at the host root and relative to BASE_URL. The results are collected into an immutable Env structure. If a probe fails, that’s recorded, but the audit isn’t aborted; a dependent criterion simply returns error.

Phase 2 — the fan-out of criteria. The 30 criteria run in parallel, again via errgroup. The key invariant: run never returns a Go error. Any failure inside it becomes Result{status: error, score: 0}. This keeps the concurrent code clean and guarantees the result slice is always complete — all 30 cells filled.

The criteria don’t talk to each other. Each is a pure function from Env to Result, with no shared mutable state. (One real bug in the project’s history was exactly about this: the orchestrator was reused across bulk-audit workers and they raced on a shared diagnostics field — the fix was to construct the orchestrator per call.)

All outbound HTTP goes through a single fetch.Client, which provides: User-Agent modes (curl, Googlebot, browser, empty), an in-memory cache for the duration of one audit, a “no more than N concurrent connections per host” limiter, and a body-size cap. For llms-full.txt (Anthropic publishes 76 MB) the body isn’t downloaded in full — a Range: bytes=0-… is taken just for the evidence snippet.

Determinism mechanisms

Requirement 1 (reproducibility) is met not “by agreement” but with concrete techniques:

  • Fixed criterion order. The slice in the registry is hard-ordered A1 → F4; this is part of the output contract.
  • Deterministic page sampling. When a criterion needs N pages from sitemap.xml, it doesn’t take them “as they come” — it sorts the URLs lexicographically and picks at uniform indices [0, n/N, 2n/N, …]. The same site → the same pages → the same score.
  • Injected time. The audit_date comes from an injectable Now() function; in tests it’s fixed. Time is the only “live” input, and it’s isolated.
  • Ordered map output. Go’s map iteration order is undefined, so categories are serialized through an intermediate ordered structure.
  • Output validation. Before returning, the Report is checked against invariants: exactly 30 criteria in canonical order; Σ scoreᵢ = total; for each category, the sum of its criteria equals score(k); MaxScore = 100; unreliable ⟺ ¬E1GatingPassed; every snippet ≤ 50 runes. In tests a violation is a panic; in prod it’s an HTTP 500. A metric that doesn’t add up with itself doesn’t go out the door.
  • A determinism test. The same fixture-site is audited twice; the JSON is compared byte by byte.

The VK-trap detector

One trap deserves its own mention; I named it after the site where I first saw it. Some sites return HTTP 200 and the same ~5 KB HTML shell for any requested URL — be it /llms.txt, /sitemap.xml, or a random path. The status code lies: it says “all good,” while the body carries no information.

A naive auditor would take such a response for a valid llms.txt or a valid Markdown page. The defense: E1 computes a SHA-256 of the body for every fetched URL. If ≥ 3 distinct URLs return the same hash, vk_trap_detected fires, E1 is force-zeroed, and the whole report gets the unreliable marker. The same principle — “HTTP 200 is the server’s opinion, not a guarantee about the content” — recurs in several criteria: A1, A2, and B1 all discard a body that starts with <!doctype/<html>, even at status 200.

Two places where ML is genuinely needed

Most criteria are HTTP probes, parsing (HTML via goquery, XML, JSON, OpenAPI via libopenapi), and regex/keyword heuristics. But exactly two criteria require a semantic judgment where regexes and keywords hit a ceiling and stop improving:

  • D2 — “are the examples realistic?” (POST /users with name: "Jane Doe" vs foo/bar/<your_api_key>);
  • C3 — “is the endpoint page complete?” (method, URL, types, the required flag, request and response examples).

For these — and only these — two small ONNX models are embedded into the binary (via //go:embed). This doesn’t contradict the “no LLM at runtime” requirement: the models are tiny, deterministic, and their inference is negligibly cheap next to parsing the page itself.

The key point: why ML only here. The other “hard” criteria — C2 (presence of Postman/SDK) and D5 (a glossary) — are link-finding and structural-parsing tasks, not semantics. For D5, ML was considered and rejected by an architecture review: the hand-engineered features (the number of spellings of a term, the dominant spelling’s share, the share of inconsistent groups) are the decision rule, with a threshold; a trained model would just reproduce them on a 50-site sample with heavy “feature = label” circularity. A threshold is simpler, more transparent, and reads right off the evidence. A good reminder: don’t reach for ML when the features already are the rule.

D2 — the placeholder detector

Inputthe text of a code block
Features20 numbers: 7 “placeholder” signals (p_bootstrap, p_angle_type, p_templating, p_your_token, p_xxxx, p_repeat_digits, p_bracket), 8 “realistic” signals (r_stripe_key, r_uuid, r_bearer, r_api_url, r_iso_ts, r_typed_id, r_long_numeric_id, r_jwt), 5 surface features (length_log, digit_ratio, upper_ratio, punct_ratio, line_count_log)
ModelStandardScaler + LogisticRegression → P(placeholder) ∈ [0,1]
Labelsrealistic, placeholder
Qualityaccuracy 0.970, AUC 0.998 (trained on 3837 code blocks, tested on 960; the metric is optimistic — see the caveat at the end of the section)

D2 collects all code blocks from 3 sample pages, computes P(placeholder) for each, flags a block as a placeholder when P > 0.6, and computes the placeholder_ratio. The score is a step function of the ratio: < 0.20 → 4, < 0.40 → 3, < 0.60 → 2, < 0.80 → 1, ≥ 0.80 → 0.

C3 — endpoint-page completeness

Inputa feature vector of the HTML page
Features15 numbers: the count of h2/h3 headings, code-block count, method-keyword density, presence of a parameter table, the number of required, presence of curl, presence of a response example, presence of a path pattern, the number of per-SDK-language blocks, the number of status codes, the number of type annotations, presence of a <dl> definition list, the number of inline code spans, the share of “navigation” headings (filters out shell pages), the number of headings carrying a method name
ModelStandardScaler + GradientBoostingClassifier (max_depth=4, n_estimators=100) → softmax over {complete, partial, absent}
Qualitymacro-F1 0.870 (holdout; the metric is optimistic — see the caveat at the end of the section)

C3 takes up to 3 endpoint pages, classifies each, and scores by the majority class: complete → 5, partial → 3, absent → 1, no pages → 0.

Discipline around the models

Two things, without which ML in a deterministic auditor turns into a source of silent bugs.

Feature-extraction parity, Go ↔ Python. Features are extracted in two languages: in Python during training, in Go at runtime. The slightest divergence in preprocessing = different predictions. So the feature order is pinned in four places at once (a Go constant, the model’s JSON sidecar, the Python schema, and the literal in the Go inference), and a parity test runs canonical inputs through both implementations and compares.

Transposition guard (for C3). If someone accidentally swaps two features in the Go literal, the model keeps working — just silently wrong. To catch this, C3 does a full pairwise feature sweep at training time; pairs whose swap shifts a class probability by ≥ 0.5 are recorded in the sidecar along with a reference vector. A Go test reads those pairs and verifies the swap really does shift the probability — i.e. that the feature order in the code hasn’t “drifted.” (D2 is protected only by the parity test; it has no separate transposition check.)

Two honest caveats, written right into the criteria’s Note field. First: these are heuristics that approximate semantics, not understanding; the models are trained on English-language docs, and non-English sites may be underscored. Second, less pleasant: most of the training labels for D2/C3 were obtained by automatically expanding rules over the same signals that later became the features. This is the same “feature = label” circularity for which I rejected ML in D5 — just partial here rather than total. The practical takeaway: the reported holdout metrics (0.970 / 0.870) are optimistic — they measure the model’s agreement with the rule-labeler, not with an independent human. I cite them as an indicator that “the model learned the rule and generalizes past its edges” (manual labeling of edge cases confirms this), not as an estimate of “true” accuracy.

Reliability, calibration, and (a little) validity

What follows are three different checks that I deliberately don’t lump into one pile labeled “validation.” The first is about reliability (a regression gate). The second is about robustness to the choice of weights (ρ). The third is about how the scale behaves on an unseen corpus. None of them, strictly speaking, proves validity; what they do and don’t prove is stated explicitly.

Regression control on reference sites (test-retest)

There are 7 reference sites with fixed expected scores. cmd/calibrate audits them and checks against ExpectedTotal ± Tolerance. The base rule (from the project plan): a deviation > 15 points is a heuristic bug, not a property of the site. The tolerance is tightened for “well-behaved” sites (Stripe, Anthropic), where a large jump signals a regression, and tighter still for VK, whose score band is narrow (0–10).

SiteExpected scoreToleranceE1 gating
docs.emergingtravel.com69±12PASS
developers.booking.com48±12PASS
docs.stripe.com42±10PASS
docs.anthropic.com30±10FAIL — a single-page app, score held up by .md
docs.github.com/en/rest25±10PASS (but anti-scraping hits the auditor)
developers.expediagroup.com23±12partial
dev.vk.com10±4FAIL — the VK-trap

Beyond the total tolerance there’s CategoryTolerance — a gate on drift within a category (±4, ±3 for VK). It catches the case where the total stayed in range but one category sagged while another rose — a mutual compensation the overall tolerance would miss.

And here is the main caveat, the one it’s easy to fool yourself with (and the one I fooled myself with at first). This is NOT construct validity. The expected scores in the table are not independent expert assessments. They are re-snapshotted actual values of a single live run, re-fixed as a regression “fuse.” Checking whether the auditor reproduces its own recent output within tolerance is test-retest reliability, within a single version — not validity (when the heuristic is changed deliberately, the baseline is consciously re-snapshotted — more on this below). The only external anchor here is the manual scores from the original article, and only on these same 7 sites (so the sample is n = 7, and calibration and verification run over the same sites — that’s in-sample). So the table says exactly one thing: “the instrument is stable and doesn’t drift away from its own prior version.” That’s valuable as a regression signal — but it’s not proof that Stripe’s 42 is “correct.”

The reference numbers themselves are the actual values of a single run plus its noise (±0–3 from the random link sample in E4, from blocks). Before tightening the tolerances, they need to be re-snapshotted across several runs. And a separate trap: Cloudflare blocks docs.github.com by our client’s TLS fingerprint and escalates with audit frequency — github oscillates between 25 and 1. So a FAIL on github is almost always a transient block, not score drift; it needs reconfirming with a single “cooled-down” run. (This, by the way, is a limitation of any active-measurement methodology: the object being measured can resist measurement.)

Robustness to the choice of weights (ρ = 0.960)

The natural objection: the fixed weights mᵢ are hand-tuned and surely suboptimal — maybe switch to corpus-relative scoring (reweight criteria by their empirical distribution function, the way Lighthouse’s log-normal scoring does)?

I checked this explicitly (the lognorm_gate.py script). I took a corpus of 112 sites re-audited under v2.3.2 (the first version with all 30 criteria, including F4; 106 reached a report, with sites blocked by anti-scraping and unreadable responses excluded), computed two rankings — by the current fixed scale and by corpus-relative reweighting of the same criteria — and compared them with Spearman’s rank correlation:

ρ(fixed scale, relative scale) = 0.960     (acceptance gate: 0.85)

The 0.85 gate was set in advance, before the run, as a conservative bound for “strong” rank correlation: if switching to the relative scale rearranged sites more substantially, we’d have seen it and taken it seriously. (Before F4 existed, the same test on 29 criteria gave ρ = 0.965 — so adding the agent criterion didn’t break the ranking either.)

What this means and, more importantly, what it does NOT. Both rankings are produced by the instrument itself — there’s no expert, no external “truth” here. So ρ = 0.960 says: the order of sites barely depends on how the weights are set. This removes one source of arbitrariness (arguing about “8 vs 7 points for C1a” is pointless) — but it’s a self-comparison, and it doesn’t prove validity: a bad metric would survive this test just as well. A known result (Wainer, 1976): an additive sum of many correlated items is generally weakly sensitive to weights. So the conclusion is modest and honest: there’s no point complicating the scale (a negative result that saves work), but “robust to weights” ≠ “measures the right thing.”

Separately, the log-normal model is simply inapplicable here: at the criterion level the distribution is zero-inflated — 14 of the 30 criteria score 0 for ≥ 50% of the corpus, i.e. they have a point mass at zero, which a log-normal can’t represent in principle.

Distribution on an unseen corpus (face validity)

This isn’t validity in the strict sense — for those sites there’s no “truth” (human scores, or, even better, measurements of an agent’s real success on a task). It’s a sanity check: does the scale behave sensibly where it wasn’t calibrated. The corpus:

  • from the public-apis catalogue, ~1182 unique sites were extracted; after a liveness check ~776 are auditable (alive_html), the rest being dead, off-domain redirects, single-page-app shells, or non-HTML endpoints;
  • a separate broad sample: developer.* / docs.* subdomains from the top-25,000 domains of the Majestic Million ranking → 2099 net-new sites, run through a bulk audit (under v2.7.0, already 30 criteria). The mean score is 14.6, and the distribution of totals is heavily right-skewed. A caveat about the instrument: the bulk run used heuristic D2/C3 stubs (no ONNX), so I later re-audited the top (≥ 40) on prod with the real models — the divergence turned out to be small.

The low mean is an expected and sensible result, not a defect. Combing the top of a domain list catches mostly product docs, not API references; the rubric honestly scores them low. The signal lives in the tail: 108 sites scored ≥ 40, of which 14 scored ≥ 60 (real API docs: docs.z.ai 76, docs.kalshi.com 75, docs.hedera.com 70). In other words, the scale separates “a product landing page” from “a real API reference” in the right direction — that’s face validity, no more and no less.

Limitations

An honest methodology lists where it’s blind.

  • Heuristics approximate semantics; they don’t understand it. D2/C3 are trained classifiers, D5 is a threshold; none of them “reads” the documentation. This is a deliberate trade-off for determinism.
  • Single-page apps are underscored by design. E1 penalizes content hidden behind JavaScript — that’s its whole point. An optional headless render (via Chrome) exists, but is off by default: the render is non-deterministic (DOM hydration timing), which would violate requirement 1. So it’s either determinism or full measurement of single-page apps — there’s no third option on a single run.
  • The models are trained on an English-language corpus. Non-English documentation is systematically underscored in D2/C3.
  • Active measurement perturbs the object. Anti-scraping (Cloudflare on github) turns the score into noise; the methodology can tell a block (error) from an honest absence (absent/not_applicable), but it can’t get around the block — and shouldn’t.
  • Calibration on a single run carries noise. The reference numbers are the actual values of a single run ±0–3; they’re good as a regression fuse, but not as analytical truth.
  • I don’t have real validity. I’ve shown reliability (determinism), robustness to weights (ρ), and a sane distribution (face validity). What’s missing is checking the score against an independent external criterion: for instance, the real success of an agent writing code against this documentation. Until such a measurement exists, the score is a score on the rubric, not a proven “AI-readiness.”
  • The weights are calibrated, not derived. Their arbitrariness is bounded (the ranking is nearly insensitive to them) but not eliminated. It’s a scale, and it should be treated as one: useful, checkable, but not “objective” in the naive sense.

What this buys us

The main result isn’t Stripe’s or VK’s particular scores. The main result is that the fuzzy construct “documentation AI-readiness” was operationalized in a way that made the measurement reproducible, evidence-backed, and with honestly drawn boundaries of applicability — and cheap at that (< 30 seconds, zero language-model calls at runtime). Reliability is fully achieved; validity only partially (the distribution is sane, the ranking is robust to weights, on 7 sites it agrees with the manual score), and that boundary is stated explicitly, not swept aside.

Three takeaways I’d carry beyond this particular project:

  1. Reproducibility is a requirement, not a feature. The moment a metric starts being used for comparison and decisions, non-determinism (including “merely” a language model as the auditor) makes it unusable. Determinism had to be built into the architecture: sorted sampling, fixed order, injected time, invariant validation.
  2. ML is a last resort, not a first. Of 30 criteria, ML was needed for exactly two; for one similar one (D5) it was considered and rejected, because the features already were the decision rule. The discipline of “heuristics first, ML only where the semantics is irreducible” saved both complexity and sources of non-determinism.
  3. Negative results save work. That reweighting test is a “don’t complicate it,” confirmed by a number: the relative scale would have bought nothing, so I didn’t build it. The important thing is not to misread what it proves: robustness to weights is one source of arbitrariness removed, not proof of correctness. The metric should keep being checked — but now against an external criterion, not against itself.

The original article with 26 criteria, a ready-made prompt, and a Claude skill is here. It’s still the best way to quickly eyeball a site’s readiness by hand. This article is about what happens when you want to replace “eyeball” with “measure.”

Terms

Where possible I stick to plain words; but some terms are names of formats, protocols, and libraries that don’t translate, or established notions from psychometrics and statistics. Collected here.

Measurement methodology

  • AI-readiness — the degree to which documentation is fit to be read and used by a programmatic agent built on a language model. The fuzzy construct the whole article tries to operationalize.
  • Operationalization — turning a fuzzy notion into a concrete, measurable procedure (here: into a 30-criterion rubric with a scoring function).
  • Determinism — the property “the same input → byte-for-byte identical output.”
  • Reliability — reproducibility of the measurement.
  • Validity — whether the instrument measures what it claims to.
  • Construct validity — the correspondence of a measurement to the theoretical notion it supposedly measures.
  • Face validity — the weakest form: the measurement “at first glance” behaves sensibly.
  • Test-retest reliability — agreement of repeated measurements of the same object.
  • Rubric — an ordered set of 30 criteria with assigned weights; the scoring scheme.
  • Criterion — a triple “identifier, maximum points, scoring procedure.”
  • Heuristic — an approximate decision rule (here, on regexes and keywords).
  • Evidence — a real URL, an HTTP response code, and a body snippet backing a criterion’s score.
  • Gating — a technique where a metric’s value isn’t zeroed but is flagged as untrustworthy (unreliable).

Formats, protocols, technologies

  • HTTP — the web’s exchange protocol; response codes (200, 500, etc.) and headers (Range, User-Agent).
  • HTML / DOM — a page’s markup and its object model in the browser.
  • JavaScript / single-page app (SPA) — a site that renders content in the browser after loading; its “raw” HTML is empty.
  • Headless rendering — rendering a page with a browser engine without a graphical interface, to see the result of JavaScript.
  • OpenAPI / Swagger / AsyncAPI — formats for machine-readable API descriptions.
  • llms.txt / llms-full.txt — pointer files for language models (spec).
  • robots.txt / sitemap.xml — standard files for search and AI robots: crawl rules and a site map.
  • Canonical URL / JSON-LD — per-page metadata: the page’s primary URL and structured markup.
  • MCP (Model Context Protocol) / WebMCP — a protocol and its web variant for an agent’s interactive access to tools on top of an API.
  • Endpoint — a concrete URL and method performing one operation.
  • SDK — a software development kit (client libraries for a specific language).
  • curl — a command-line HTTP tool; here, a synonym for a “raw” request without a browser.
  • ONNX — a portable machine-learning model format; the models are embedded in the binary and run locally.
  • errgroup, goquery, libopenapi — Go libraries: a group of goroutines (lightweight threads of execution in Go) with shared cancellation on the first error, HTML parsing, and OpenAPI parsing respectively.

Machine learning and statistics

  • Feature — a single input number for a model (e.g. the share of digits in the text).
  • StandardScaler — feature standardization (subtract the mean, divide by the spread).
  • LogisticRegression — logistic regression, a linear classifier.
  • GradientBoostingClassifier — a classifier based on gradient boosting of trees.
  • softmax — a function that turns a vector of numbers into a probability distribution over classes.
  • Holdout — a portion of the data held out from training for an honest check.
  • accuracy / AUC / macro-F1 — classifier quality metrics: the share of correct answers; the area under the ROC curve; the class-averaged F-measure.
  • Spearman’s rank correlation (ρ) — a measure of agreement between two rankings; 1 is a perfect match of order.
  • Zero-inflated distribution — a distribution with a large point mass at zero; a log-normal model doesn’t describe it.
  • Log-normal scoring — a way to map a value to a score through a log-normal distribution function (used in Lighthouse).

References

Specifications and standards

Measurement, reliability, and validity

  • Cronbach, L. J., & Meehl, P. E. (1955). Construct Validity in Psychological Tests. Psychological Bulletin, 52(4), 281–302. — the classic statement of construct validity.
  • Wainer, H. (1976). Estimating Coefficients in Linear Models: It Don’t Make No Nevermind. Psychological Bulletin, 83(2), 213–217. — on the weak sensitivity of additive scales to weights.
  • Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72–101. — the original work on rank correlation.
  • Spearman’s rank correlation coefficient — a modern account of ρ.
  • Zero-inflated model — on zero-inflated distributions.

Context