Architecture
Overview
Section titled “Overview” Agent (Claude, GPT, etc.) | +---------+---------+ | | MCP Server CLI (FastMCP) (Click commands) | | +---------+---------+ | QueryEngine (DuckDB SQL) | +------------+------------+ | | | SyncEngine File views Metadata (dlt + parquet) (DuckDB) (_dinobase.*)Core components
Section titled “Core components”DuckDB (query engine + storage)
Section titled “DuckDB (query engine + storage)”DuckDB is an embedded analytical database. It:
- Reads parquet files natively from disk and S3
- Executes SQL with PostgreSQL-compatible syntax plus analytical extensions
- Stores metadata in internal tables (
_dinobase.*) - Creates views over file sources for zero-copy reads
- Runs in-process (no separate server)
The .duckdb file stores metadata and view definitions. Synced data lives in parquet files.
dlt (sync engine)
Section titled “dlt (sync engine)”dlt (data load tool) handles data ingestion:
- Verified sources for major SaaS APIs
- REST API connector via YAML configs (50+ sources)
- GraphQL connector with Relay-style pagination
sql_databaseconnector for any SQLAlchemy-compatible databasefilesystemconnector for cloud storage- Handles pagination, rate limiting, incremental loading
- Writes to parquet via DuckDB destination
MCP (agent interface)
Section titled “MCP (agent interface)”MCP (Model Context Protocol) provides the agent tool-calling interface:
- FastMCP server with stdio transport
- Dynamic instructions computed from database state
- Seven tools:
query,list_sources,describe,confirm,confirm_batch,cancel,refresh
Click (CLI)
Section titled “Click (CLI)”The CLI uses Click for command parsing. All data commands output JSON by default for agent consumption.
Module structure
Section titled “Module structure”dinobase/ __init__.py # Package init __version__.py # Version cli.py # CLI commands (Click) config.py # YAML config management db.py # DinobaseDB (DuckDB wrapper) query/ engine.py # QueryEngine (execute, list, describe, live fetch) mutations.py # MutationEngine (UPDATE/INSERT with preview/confirm) fetch/ client.py # LiveFetchClient (single-record API calls) sync/ engine.py # SyncEngine (dlt pipeline runner) scheduler.py # SyncScheduler (concurrent, scheduled) registry.py # Source registry (99 entries + parquet/csv) metadata.py # API metadata extraction (Stripe, HubSpot, Postgres) yaml_source.py # YAML-to-dlt translation for REST/GraphQL configs write_client.py # Write-back to source APIs source_config.py # YAML config loader for write endpoints sources/ parquet.py # File source handler (views) graphql.py # GraphQL source with Relay pagination apis/ # 54 YAML source definitions mcp/ server.py # FastMCP server + tools __main__.py # MCP entry pointData flow
Section titled “Data flow”API source sync (local mode)
Section titled “API source sync (local mode)”SyncEngine.sync()called with source name and config- Source function loaded from registry via
get_source() - dlt pipeline created with DuckDB destination
- Data loaded to parquet, tables created in source schema
- Metadata extracted from source API (OpenAPI, Properties API, pg_catalog)
- Annotations stored in
_dinobase.columns - Sync logged in
_dinobase.sync_log
API source sync (cloud mode)
Section titled “API source sync (cloud mode)”SyncEngine.sync()called with source name and config- Source function loaded from registry via
get_source() - dlt pipeline created with filesystem destination (writes parquet to S3/GCS/Azure)
- DuckDB views created over cloud parquet:
read_parquet('s3://.../*.parquet') _live_*staging tables created in-memory for mutation overlay- Metadata persisted to cloud as
_meta/*.parquetfiles - On server restart, metadata is loaded from cloud and views are recreated
File source add
Section titled “File source add”- Schema created in DuckDB
- Files resolved (directory scan, glob, S3 URL)
- DuckDB view created per file:
CREATE VIEW schema.table AS SELECT * FROM read_parquet('...') - Basic metadata inferred from column names
- Row counts computed
Query execution
Section titled “Query execution”- SQL received via CLI or MCP
QueryEngine.execute()checks if it’s a mutation (UPDATE/INSERT)- Mutations are routed to
MutationEnginefor preview/confirm flow - For SELECTs, check if it’s a single-record lookup on a stale source
- If stale + ID lookup + YAML config exists: call source API directly (live fetch)
- Otherwise: run on DuckDB (parquet data)
- Results serialized to JSON-safe format with
_freshnesstag (liveorsynced) - Truncation applied if over
max_rows
Mutation execution
Section titled “Mutation execution”- Agent sends UPDATE/INSERT via
querytool MutationEngineparses SQL, validates guardrails (no DELETE/DROP)- Engine generates preview (affected rows, per-row diffs)
- Preview returned with
mutation_id— nothing executed yet - Agent calls
confirm(mutation_id)to execute - Engine calls source API (write-back) AND updates local data
- Everything logged in
_dinobase.mutations
Concurrency model
Section titled “Concurrency model”The sync scheduler uses a thread pool (ThreadPoolExecutor):
- Each source gets its own thread
- Each thread creates its own
DinobaseDBconnection (DuckDB supports concurrent readers) - A lock prevents scheduling a source that’s already syncing
- Default: 10 concurrent workers, configurable via
--max-workers
Design decisions
Section titled “Design decisions”Parquet over direct DB writes: Data stays portable. Switch to cloud storage with --storage s3://... and the same queries work.
DuckDB views for files: Zero-copy reads mean file sources are instant. No sync step, no data duplication.
Cloud-native storage: When configured with cloud storage, DuckDB runs in-memory and reads parquet files directly from S3/GCS/Azure via the httpfs extension. No local disk needed. Metadata is persisted as small parquet files in a _meta/ prefix.
Registry-based sources: Adding a source is a data entry, not a code change. The registry maps source names to dlt import paths and credential schemas.
Dynamic MCP instructions: The server computes instructions from actual database state. Agents know exactly what data is available without hardcoded docs.