Benchmark Methodology & Evidence

This section exists to avoid shallow comparison content. We publish the exact test shape, what was measured directly, what was not measured directly, and where third-party evidence is used.

Last reviewed: February 2026. Metrics can shift as vendors update infrastructure, pricing, and anti-bot behavior.

Evidence status: insufficient_sample

Policy·Methodology

Methodology (Directly Measured)

We ran 2 attempts per URL across 8 public URLs (16 total runs per tool) from the same environment, measuring: success rate, median latency, p95 latency, deterministic output rate (same content hash across runs), and output size. Direct live measurements were executed for Markdown for Agents and Jina Reader because they are publicly callable from this environment. Firecrawl and Crawl4AI metrics are labeled as third-party evidence where direct parity testing was not possible in this session.

example.comwikipedia.org/wiki/Markdowndocs.python.org/3/tutorialrfc-editor.org/rfc/rfc9110developer.mozilla.org/.../Accept

Tool	Success Rate	Median Latency	P95 Latency	Determinism
Markdown for Agents	100% (16/16)	145 ms	1,213 ms	87.5% (7/8)
Jina Reader	87.5% (14/16)	2,446 ms	30,715 ms	87.5% (7/8)

Evidence status: insufficient_sample

Policy

Third-Party Signals (Clearly Labeled)

Firecrawl: 95.3% success, 16 pages/s, 89.0% RAG Recall@5 (Spider benchmark, Feb 2026)
Crawl4AI: 89.7% success, 12 pages/s, 84.5% RAG Recall@5 (Spider benchmark, Feb 2026)
Cloudflare Markdown for Agents: No neutral public benchmark found at review time (Cloudflare changelog/docs reviewed Feb 2026)

Third-party numbers are not treated as first-party truth. We include them as directional evidence only.

Sources: Spider benchmark, Jina ReaderLM v2 notes, Cloudflare changelog.