Semantic Layer & Annotations
Agents need to understand data, not just query it. Dinobase builds a semantic layer on top of synced data — table descriptions, column documentation, PII flags, and relationship graphs — so agents can write correct SQL without hallucinating field meanings or missing join paths.
How it works
Section titled “How it works”When you sync a source, Dinobase automatically:
- Loads data via dlt
- Fetches column-level metadata from the source API (descriptions, enum values, format hints)
- Detects foreign-key relationships from column name patterns (
*_id→ matching table) - If
ANTHROPIC_API_KEYis set, runs a background Claude agent to fill remaining gaps with table descriptions, column docs, and PII flags - Stores everything in
_dinobase.{tables,columns,relationships,metadata}and surfaces it throughdescribe
The agent runs in the background — sync finishes immediately.
Automatic relationship detection
Section titled “Automatic relationship detection”After every sync, Dinobase scans column names and wires up relationships without any configuration:
customer_idinsubscriptions→customers.idplan_idinsubscriptions→plans.id_dlt_parent_idin nested tables → parent table’s_dlt_id
These relationships appear in describe as related_tables, so agents know exactly how to join across tables.
Auto-annotation with Claude
Section titled “Auto-annotation with Claude”Set ANTHROPIC_API_KEY and Dinobase will run a Claude agent after each sync to annotate everything the heuristics can’t:
export ANTHROPIC_API_KEY=sk-ant-...dinobase sync stripe# [semantic-agent] stripe: heuristic detected 3 relationship(s)# [semantic-agent] stripe: running Claude annotation...# [semantic-agent] stripe: annotation completeClaude audits what’s already annotated, then fills only the gaps:
- Table descriptions (“All Stripe customer accounts”)
- Column descriptions (“Customer’s billing email address”)
- PII flags on email, name, phone, IP, and user-identifying fields
- Any relationships the heuristic missed
Already-annotated sources are skipped automatically — re-syncing the same source doesn’t re-annotate.
The daemon uses a built-in agent loop that calls the Anthropic API directly — no claude CLI required. This makes it work equally in local dev, Docker containers, and cloud workers.
For manual annotation in Claude Code, use the skill: /indexing-semantic-layer stripe. The skill and the daemon implement the same logic; the daemon just runs it in-process without spawning a subagent subprocess.
Disable auto-annotation
Section titled “Disable auto-annotation”export DINOBASE_AUTO_ANNOTATE=falsedinobase sync stripe # fast — no annotation stepManual annotation
Section titled “Manual annotation”Annotate any table or column yourself:
# Table descriptiondinobase annotate stripe.customers description "All Stripe customer accounts"
# Column descriptiondinobase annotate stripe.customers.email description "Customer's billing email"
# PII flagdinobase annotate stripe.customers.email pii true
# Custom owner tagdinobase annotate stripe.customers owner "billing-team"
# Relationshipdinobase annotate stripe.subscriptions customer_id stripe.customers id \ --cardinality one_to_many \ --description "Each subscription belongs to one customer"Or pass a JSON array to annotate many things at once:
dinobase annotate '[ {"target": "stripe.customers", "key": "description", "value": "All Stripe customer accounts"}, {"target": "stripe.customers.email", "key": "pii", "value": "true"}, {"from_table": "stripe.subscriptions", "from_column": "customer_id", "to_table": "stripe.customers", "to_column": "id", "cardinality": "one_to_many", "description": "Each subscription belongs to one customer"}]'Run dinobase annotate --input-schema to see the full JSON schema.
Source-specific metadata (from API)
Section titled “Source-specific metadata (from API)”Dinobase also fetches structured metadata from source APIs at sync time:
Stripe
Section titled “Stripe”Source: Stripe OpenAPI spec
Extracts:
- Field descriptions (e.g., “Unique identifier for the object”)
- Type annotations (e.g.,
unix-timeformat) - Enum values (e.g., subscription status:
active,past_due,canceled)
Example from dinobase describe stripe.customers --pretty:
stripe.customers (180 rows)Description: All Stripe customer accounts
id VARCHAR -- Unique identifier for the object. email VARCHAR -- Customer's billing email address. [PII] Can be null created INTEGER -- Time at which the object was created. Unix timestamp. Use to_timestamp() to convert. currency VARCHAR -- Three-letter ISO code for the currency delinquent BOOLEAN -- Whether the customer has an overdue invoice.
Related tables: stripe.subscriptions (customer_id → id, one_to_many) stripe.charges (customer_id → id, one_to_many)HubSpot
Section titled “HubSpot”Source: HubSpot Properties API
Fetches live property metadata from your portal:
- Property labels and descriptions
- Enum options with human-readable labels
- Custom properties and calculated fields
PostgreSQL
Section titled “PostgreSQL”Source: pg_catalog system tables
- Column comments (
COMMENT ON COLUMN ...) - Foreign key constraints → automatically converted to relationships
Querying the semantic layer
Section titled “Querying the semantic layer”Inspect annotations directly with SQL:
# Tables with and without descriptionsdinobase query " SELECT table_name, row_count, CASE WHEN description IS NOT NULL THEN '✓' ELSE '✗' END as described FROM _dinobase.tables WHERE schema_name = 'stripe' ORDER BY row_count DESC" --pretty
# Relationship graphdinobase query " SELECT from_table, from_column, to_table, to_column, cardinality FROM _dinobase.relationships WHERE from_schema = 'stripe'" --pretty
# PII columnsdinobase query " SELECT schema_name, table_name, column_name FROM _dinobase.metadata WHERE key = 'pii' AND value = 'true'" --prettyHow agents use the semantic layer
Section titled “How agents use the semantic layer”When an agent calls describe, it gets the full semantic context:
{ "schema": "stripe", "table": "subscriptions", "description": "Active and historical customer subscriptions", "columns": [ { "name": "customer_id", "type": "VARCHAR", "description": "References customers.id", "nullable": true }, { "name": "status", "type": "VARCHAR", "description": "Subscription lifecycle state", "note": "Values: active, past_due, canceled, trialing, unpaid" } ], "related_tables": [ { "table": "stripe.customers", "join": "ON subscriptions.customer_id = customers.id", "cardinality": "many_to_one" } ]}The agent knows the join path, the column semantics, and the enum values — no hallucinating, no guessing.