Benchmark Methodology & Evidence

This section exists to avoid shallow comparison content. We publish the exact test shape, what was measured directly, what was not measured directly, and where third-party evidence is used.

Last reviewed: February 2026. Metrics can shift as vendors update infrastructure, pricing, and anti-bot behavior.

Evidence status: insufficient_sample

Methodology (Directly Measured)

We ran 2 attempts per URL across 8 public URLs (16 total runs per tool) from the same environment, measuring: success rate, median latency, p95 latency, deterministic output rate (same content hash across runs), and output size. Direct live measurements were executed for Markdown for Agents and Jina Reader because they are publicly callable from this environment. Firecrawl and Crawl4AI metrics are labeled as third-party evidence where direct parity testing was not possible in this session.

example.comwikipedia.org/wiki/Markdowndocs.python.org/3/tutorialrfc-editor.org/rfc/rfc9110developer.mozilla.org/.../Accept
ToolSuccess RateMedian LatencyP95 LatencyDeterminism
Markdown for Agents100% (16/16)145 ms1,213 ms87.5% (7/8)
Jina Reader87.5% (14/16)2,446 ms30,715 ms87.5% (7/8)
Evidence status: insufficient_sample

Third-Party Signals (Clearly Labeled)

  • Firecrawl: 95.3% success, 16 pages/s, 89.0% RAG Recall@5 (Spider benchmark, Feb 2026)
  • Crawl4AI: 89.7% success, 12 pages/s, 84.5% RAG Recall@5 (Spider benchmark, Feb 2026)
  • Cloudflare Markdown for Agents: No neutral public benchmark found at review time (Cloudflare changelog/docs reviewed Feb 2026)

Third-party numbers are not treated as first-party truth. We include them as directional evidence only.

Sources: Spider benchmark, Jina ReaderLM v2 notes, Cloudflare changelog.