Architecture

Overview

                    Agent (Claude, GPT, etc.)
                              |
                    +---------+---------+
                    |                   |
               MCP Server             CLI
               (FastMCP)         (Click commands)
                    |                   |
                    +---------+---------+
                              |
                        QueryEngine
                        (DuckDB SQL)
                              |
                 +------------+------------+
                 |            |            |
           SyncEngine    File views    Metadata
           (dlt + parquet)  (DuckDB)   (_dinobase.*)

Core components

DuckDB (query engine + storage)

DuckDB is an embedded analytical database. It:

Reads parquet files natively from disk and S3
Executes SQL with PostgreSQL-compatible syntax plus analytical extensions
Stores metadata in internal tables (_dinobase.*)
Creates views over file sources for zero-copy reads
Runs in-process (no separate server)

The .duckdb file stores metadata and view definitions. Synced data lives in parquet files.

dlt (sync engine)

dlt (data load tool) handles data ingestion:

Verified sources for major SaaS APIs
REST API connector via YAML configs (50+ sources)
GraphQL connector with Relay-style pagination
sql_database connector for any SQLAlchemy-compatible database
filesystem connector for cloud storage
Handles pagination, rate limiting, incremental loading
Writes to parquet via DuckDB destination

MCP (agent interface)

MCP (Model Context Protocol) provides the agent tool-calling interface:

FastMCP server with stdio transport
Dynamic instructions computed from database state
Seven tools: query, list_sources, describe, confirm, confirm_batch, cancel, refresh

Click (CLI)

The CLI uses Click for command parsing. All data commands output JSON by default for agent consumption.

Module structure

dinobase/
  __init__.py              # Package init
  __version__.py           # Version
  cli.py                   # CLI commands (Click)
  config.py                # YAML config management
  db.py                    # DinobaseDB (DuckDB wrapper)
  query/
    engine.py              # QueryEngine (execute, list, describe, live fetch)
    mutations.py           # MutationEngine (UPDATE/INSERT with preview/confirm)
  fetch/
    client.py              # LiveFetchClient (single-record API calls)
  sync/
    engine.py              # SyncEngine (dlt pipeline runner)
    scheduler.py           # SyncScheduler (concurrent, scheduled)
    registry.py            # Source registry (99 entries + parquet/csv)
    metadata.py            # API metadata extraction (Stripe, HubSpot, Postgres)
    yaml_source.py         # YAML-to-dlt translation for REST/GraphQL configs
    write_client.py        # Write-back to source APIs
    source_config.py       # YAML config loader for write endpoints
    sources/
      parquet.py           # File source handler (views)
      graphql.py           # GraphQL source with Relay pagination
      apis/                # 54 YAML source definitions
  mcp/
    server.py              # FastMCP server + tools
    __main__.py            # MCP entry point

Data flow

API source sync (local mode)

SyncEngine.sync() called with source name and config
Source function loaded from registry via get_source()
dlt pipeline created with DuckDB destination
Data loaded to parquet, tables created in source schema
Metadata extracted from source API (OpenAPI, Properties API, pg_catalog)
Annotations stored in _dinobase.columns
Sync logged in _dinobase.sync_log

API source sync (cloud mode)

SyncEngine.sync() called with source name and config
Source function loaded from registry via get_source()
dlt pipeline created with filesystem destination (writes parquet to S3/GCS/Azure)
DuckDB views created over cloud parquet: read_parquet('s3://.../*.parquet')
_live_* staging tables created in-memory for mutation overlay
Metadata persisted to cloud as _meta/*.parquet files
On server restart, metadata is loaded from cloud and views are recreated

File source add

Schema created in DuckDB
Files resolved (directory scan, glob, S3 URL)
DuckDB view created per file: CREATE VIEW schema.table AS SELECT * FROM read_parquet('...')
Basic metadata inferred from column names
Row counts computed

Query execution

SQL received via CLI or MCP
QueryEngine.execute() checks if it’s a mutation (UPDATE/INSERT)
Mutations are routed to MutationEngine for preview/confirm flow
For SELECTs, check if it’s a single-record lookup on a stale source
If stale + ID lookup + YAML config exists: call source API directly (live fetch)
Otherwise: run on DuckDB (parquet data)
Results serialized to JSON-safe format with _freshness tag (live or synced)
Truncation applied if over max_rows

Mutation execution

Agent sends UPDATE/INSERT via query tool
MutationEngine parses SQL, validates guardrails (no DELETE/DROP)
Engine generates preview (affected rows, per-row diffs)
Preview returned with mutation_id — nothing executed yet
Agent calls confirm(mutation_id) to execute
Engine calls source API (write-back) AND updates local data
Everything logged in _dinobase.mutations

Concurrency model

The sync scheduler uses a thread pool (ThreadPoolExecutor):

Each source gets its own thread
Each thread creates its own DinobaseDB connection (DuckDB supports concurrent readers)
A lock prevents scheduling a source that’s already syncing
Default: 10 concurrent workers, configurable via --max-workers

Design decisions

Parquet over direct DB writes: Data stays portable. Switch to cloud storage with --storage s3://... and the same queries work.

DuckDB views for files: Zero-copy reads mean file sources are instant. No sync step, no data duplication.

Cloud-native storage: When configured with cloud storage, DuckDB runs in-memory and reads parquet files directly from S3/GCS/Azure via the httpfs extension. No local disk needed. Metadata is persisted as small parquet files in a _meta/ prefix.

Registry-based sources: Adding a source is a data entry, not a code change. The registry maps source names to dlt import paths and credential schemas.

Dynamic MCP instructions: The server computes instructions from actual database state. Agents know exactly what data is available without hardcoded docs.