Benchmark
We benchmarked Dinobase SQL against per-source MCP tools across 11 LLMs and 75 business questions. Same models, same data, same questions — only the data access method differs.
Results
Section titled “Results”| Metric | Dinobase (SQL) | Per-Source MCP |
|---|---|---|
| Accuracy | 91% | 35% |
| Avg latency | 34s | 106s |
| Cost per correct answer | $0.027 | $0.445 |
56 percentage points more accurate, 3x faster, 16x cheaper per correct answer — consistent across all 11 models tested.
Why the gap
Section titled “Why the gap”No cross-source joins. Per-source MCP tools expose one tool per table. Questions that span two SaaS tools — “which customers have open HubSpot deals and failed Stripe charges?” — require an agent to manually correlate two JSON responses. This is error-prone and frequently exceeds context limits. With Dinobase, the agent writes one SQL JOIN.
No semantic metadata. MCP tools return raw data with no column descriptions. Without knowing what fields mean, agents misinterpret units, formulas, and aggregations — producing systematically wrong answers. Dinobase attaches column descriptions extracted from source API schemas.
Pagination overhead. Counting 1,000 records over MCP requires 10 round trips. SQL returns a single aggregate. The latency difference is primarily from models that make many MCP calls before running out of turns or context.
Per-model breakdown
Section titled “Per-model breakdown”| Model | SQL Accuracy | MCP Accuracy | Gap | SQL $/Correct | MCP $/Correct |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 100% | 33% | +67pp | $0.081 | $1.646 |
| Claude Sonnet 4.6 | 100% | 53% | +47pp | $0.046 | $0.661 |
| Claude Haiku 4.5 | 93% | 33% | +60pp | $0.020 | $0.149 |
| DeepSeek V3.2 | 93% | 20% | +73pp | $0.012 | $0.141 |
| GLM-5 Turbo | 93% | 20% | +73pp | $0.012 | $0.824 |
| MiniMax M2.7 | 93% | 27% | +67pp | $0.005 | $0.131 |
| Gemini 3 Flash | 87% | 33% | +53pp | $0.005 | $0.039 |
| Gemini 3.1 Pro | 87% | 40% | +47pp | $0.059 | $0.822 |
| GPT 5.4 | 87% | 40% | +47pp | $0.036 | $0.278 |
| Qwen 3.5 27B | 87% | 33% | +53pp | $0.004 | $0.056 |
| Kimi K2.5 | 80% | 47% | +33pp | $0.013 | $0.084 |
Methodology
Section titled “Methodology”- 75 questions across 3 tiers: simple (single-source counts/filters), semantic (MRR, win rate, CSAT — requires domain knowledge), and cross-source (e.g. require joining HubSpot and Stripe data)
- Scoring: deterministic for ~60% of questions (regex number extraction with tolerance), LLM-as-judge (Claude Haiku 4.5) for the rest
- Cost per correct answer: total API cost divided by number of correct answers — this penalizes approaches that spend tokens on wrong answers
- Total benchmark cost: $29.44 across all 11 models via OpenRouter
Full dataset, questions, ground truth SQL, and scoring details: benchmarks/README.md.