Skip to content

Architecture

Agent (Claude, GPT, etc.)
|
+---------+---------+
| |
MCP Server CLI
(FastMCP) (Click commands)
| |
+---------+---------+
|
QueryEngine
(DuckDB SQL)
|
+------------+------------+
| | |
SyncEngine File views Metadata
(dlt + parquet) (DuckDB) (_dinobase.*)

DuckDB is an embedded analytical database. It:

  • Reads parquet files natively from disk and S3
  • Executes SQL with PostgreSQL-compatible syntax plus analytical extensions
  • Stores metadata in internal tables (_dinobase.*)
  • Creates views over file connectors for zero-copy reads
  • Runs in-process (no separate server)

The .duckdb file stores metadata and view definitions. Synced data lives in parquet files.

dlt (data load tool) handles data ingestion:

  • Verified sources for major SaaS APIs
  • REST API connector via YAML configs (50+ connectors)
  • GraphQL connector with Relay-style pagination
  • sql_database connector for any SQLAlchemy-compatible database
  • filesystem connector for cloud storage
  • Handles pagination, rate limiting, incremental loading
  • Writes to parquet via DuckDB destination

MCP (Model Context Protocol) provides the agent tool-calling interface:

  • FastMCP server with stdio transport
  • Dynamic instructions computed from database state
  • Eight tools: query, list_connectors, describe, confirm, confirm_batch, cancel, refresh, exec_code

The CLI uses Click for command parsing. All data commands output JSON by default for agent consumption.

dinobase/
__init__.py # Package init
__version__.py # Version
cli.py # CLI commands (Click)
config.py # YAML config management
db.py # DinobaseDB (DuckDB wrapper)
query/
engine.py # QueryEngine (execute, list, describe, live fetch)
mutations.py # MutationEngine (UPDATE/INSERT with preview/confirm)
fetch/
client.py # LiveFetchClient (single-record API calls)
sync/
engine.py # SyncEngine (dlt pipeline runner)
scheduler.py # SyncScheduler (concurrent, scheduled)
registry.py # Connector registry (99 entries + parquet/csv)
metadata.py # API metadata extraction (Stripe, HubSpot, Postgres)
yaml_source.py # YAML-to-dlt translation for REST/GraphQL configs
write_client.py # Write-back to upstream APIs
source_config.py # YAML config loader for write endpoints
sources/
parquet.py # File connector handler (views)
graphql.py # GraphQL connector with Relay pagination
apis/ # 54 YAML connector definitions
mcp/
server.py # FastMCP server + tools
__main__.py # MCP entry point
  1. SyncEngine.sync() called with connector name and config
  2. Connector function loaded from registry via get_source()
  3. dlt pipeline created with DuckDB destination
  4. Data loaded to parquet, tables created in connector schema
  5. Metadata extracted from upstream API (OpenAPI, Properties API, pg_catalog)
  6. Annotations stored in _dinobase.columns
  7. Sync logged in _dinobase.sync_log
  1. SyncEngine.sync() called with connector name and config
  2. Connector function loaded from registry via get_source()
  3. dlt pipeline created with filesystem destination (writes parquet to S3/GCS/Azure)
  4. DuckDB views created over cloud parquet: read_parquet('s3://.../*.parquet')
  5. _live_* staging tables created in-memory for mutation overlay
  6. Metadata persisted to cloud as _meta/*.parquet files
  7. On server restart, metadata is loaded from cloud and views are recreated
  1. Schema created in DuckDB
  2. Files resolved (directory scan, glob, S3 URL)
  3. DuckDB view created per file: CREATE VIEW schema.table AS SELECT * FROM read_parquet('...')
  4. Basic metadata inferred from column names
  5. Row counts computed
  1. SQL received via CLI or MCP
  2. QueryEngine.execute() checks if it’s a mutation (UPDATE/INSERT)
  3. Mutations are routed to MutationEngine for preview/confirm flow
  4. For SELECTs, check if it’s a single-record lookup on a stale connector
  5. If stale + ID lookup + YAML config exists: call upstream API directly (live fetch)
  6. Otherwise: run on DuckDB (parquet data)
  7. Results serialized to JSON-safe format with _freshness tag (live or synced)
  8. Truncation applied if over max_rows
  1. Agent sends UPDATE/INSERT via query tool
  2. MutationEngine parses SQL, validates guardrails (no DELETE/DROP)
  3. Engine generates preview (affected rows, per-row diffs)
  4. Preview returned with mutation_id — nothing executed yet
  5. Agent calls confirm(mutation_id) to execute
  6. Engine calls upstream API (write-back) AND updates local data
  7. Everything logged in _dinobase.mutations

The sync scheduler uses a thread pool (ThreadPoolExecutor):

  • Each connector gets its own thread
  • Each thread creates its own DinobaseDB connection (DuckDB supports concurrent readers)
  • A lock prevents scheduling a connector that’s already syncing
  • Default: 10 concurrent workers, configurable via --max-workers

Parquet over direct DB writes: Data stays portable. Switch to cloud storage with --storage s3://... and the same queries work.

DuckDB views for files: Zero-copy reads mean file connectors are instant. No sync step, no data duplication.

Cloud-native storage: When configured with cloud storage, DuckDB runs in-memory and reads parquet files directly from S3/GCS/Azure via the httpfs extension. No local disk needed. Metadata is persisted as small parquet files in a _meta/ prefix.

Registry-based connectors: Adding a connector is a data entry, not a code change. The registry maps connector names to dlt import paths and credential schemas.

Dynamic MCP instructions: The server computes instructions from actual database state. Agents know exactly what data is available without hardcoded docs.