Benchmark Methodology & Evidence
This section exists to avoid shallow comparison content. We publish the exact test shape, what was measured directly, what was not measured directly, and where third-party evidence is used.
Last reviewed: February 2026. Metrics can shift as vendors update infrastructure, pricing, and anti-bot behavior.
Methodology (Directly Measured)
We ran 2 attempts per URL across 8 public URLs (16 total runs per tool) from the same environment, measuring: success rate, median latency, p95 latency, deterministic output rate (same content hash across runs), and output size. Direct live measurements were executed for Markdown for Agents and Jina Reader because they are publicly callable from this environment. Firecrawl and Crawl4AI metrics are labeled as third-party evidence where direct parity testing was not possible in this session.
| Tool | Success Rate | Median Latency | P95 Latency | Determinism |
|---|---|---|---|---|
| Markdown for Agents | 100% (16/16) | 145 ms | 1,213 ms | 87.5% (7/8) |
| Jina Reader | 87.5% (14/16) | 2,446 ms | 30,715 ms | 87.5% (7/8) |
Third-Party Signals (Clearly Labeled)
- Firecrawl: 95.3% success, 16 pages/s, 89.0% RAG Recall@5 (Spider benchmark, Feb 2026)
- Crawl4AI: 89.7% success, 12 pages/s, 84.5% RAG Recall@5 (Spider benchmark, Feb 2026)
- Cloudflare Markdown for Agents: No neutral public benchmark found at review time (Cloudflare changelog/docs reviewed Feb 2026)
Third-party numbers are not treated as first-party truth. We include them as directional evidence only.
Sources: Spider benchmark, Jina ReaderLM v2 notes, Cloudflare changelog.