Skip to content

Querying Data

Dinobase uses DuckDB as its query engine. All queries use DuckDB SQL syntax, which is PostgreSQL-compatible with extra analytical functions.

Tables are referenced as schema.table:

SELECT * FROM stripe.customers LIMIT 10
Terminal window
# JSON output (default, agent-friendly)
dinobase query "SELECT * FROM stripe.customers LIMIT 5"
# Human-readable table
dinobase query "SELECT * FROM stripe.customers LIMIT 5" --pretty
# Limit result size
dinobase query "SELECT * FROM hubspot.contacts" --max-rows 1000

Agents use the query tool:

{
"tool": "query",
"arguments": {
"sql": "SELECT * FROM stripe.customers LIMIT 5",
"max_rows": 200
}
}

The core power of Dinobase. Join tables from different sources on shared columns:

SELECT s.email, s.name, h.company
FROM stripe.customers s
JOIN hubspot.contacts h ON s.email = h.email
SELECT
s.email,
h.company,
d.amount,
d.dealstage
FROM stripe.customers s
JOIN hubspot.contacts h ON s.email = h.email
JOIN hubspot.deals d ON h.id = d.contact_id
WHERE d.dealstage = 'closedwon'
ORDER BY d.amount DESC

Use dinobase describe to find columns that work as join keys. Dinobase annotates likely join columns:

Terminal window
dinobase describe stripe.customers --pretty
# Look for: email (marked as "Potential join key across sources")

Common join patterns:

ColumnSourcesNotes
emailMost sourcesBest cross-source join key
*_idWithin a sourceForeign keys (e.g., contact_id)
domainCRM sourcesCompany matching
-- Count customers per company
SELECT h.company, COUNT(DISTINCT s.email) as customers
FROM stripe.customers s
JOIN hubspot.contacts h ON s.email = h.email
GROUP BY h.company
ORDER BY customers DESC
-- Revenue by deal stage
SELECT d.dealstage, SUM(d.amount) as total, COUNT(*) as deals
FROM hubspot.deals d
GROUP BY d.dealstage
ORDER BY total DESC

Use LEFT JOIN to find data in one source but not another:

-- Stripe customers without a HubSpot contact
SELECT s.email, s.name
FROM stripe.customers s
LEFT JOIN hubspot.contacts h ON s.email = h.email
WHERE h.email IS NULL

Stripe stores timestamps as Unix integers. Use DuckDB’s to_timestamp():

SELECT email, to_timestamp(created) as created_date
FROM stripe.customers
WHERE to_timestamp(created) > '2024-01-01'
ORDER BY created DESC

HubSpot uses ISO 8601 strings, which DuckDB handles natively:

SELECT email, createdate
FROM hubspot.contacts
WHERE createdate > '2024-01-01'

DuckDB has powerful analytical functions beyond standard SQL:

SELECT email, amount,
SUM(amount) OVER (ORDER BY created) as running_total
FROM stripe.charges
SELECT company,
LIST(email) as all_emails,
COUNT(*) as contact_count
FROM hubspot.contacts
GROUP BY company
SELECT email FROM stripe.customers
WHERE regexp_matches(email, '.*@gmail\.com$')

DuckDB can also query parquet files inline:

SELECT * FROM read_parquet('path/to/file.parquet') LIMIT 10

By default, queries return up to 200 rows. If there are more, the result includes a truncated flag and total_rows count:

{
"rows": [...],
"row_count": 200,
"total_rows": 1547,
"truncated": true,
"message": "Showing 200 of 1547 rows. Add LIMIT or a WHERE clause to narrow results."
}

Use --max-rows (CLI) or the max_rows parameter (MCP) to adjust, up to 10,000.

UPDATE and INSERT statements sent through query are intercepted and routed to the mutation engine. Instead of executing immediately, they return a preview with a mutation_id. See the Mutations guide for details.

Terminal window
dinobase query "UPDATE stripe.customers SET name = 'Acme' WHERE id = 'cus_123'"
# Returns preview, not execution
dinobase confirm mut_abc123def456
# Now it executes

SQL errors are returned in the result rather than thrown:

{
"error": "Catalog Error: Table with name 'nonexistent' does not exist!"
}

Check for the error key before processing results.