Skip to content

File Sources

File sources are the simplest way to get data into Dinobase. No sync needed — DuckDB reads files directly at query time through views.

Terminal window
# Directory of parquet files
dinobase add parquet --path ./data/events/ --name analytics
# Single file
dinobase add parquet --path ./data/export.parquet --name export
# Glob pattern
dinobase add parquet --path "./data/*.parquet" --name data
# S3
dinobase add parquet --path s3://your-bucket/exports/ --name warehouse
# GCS
dinobase add parquet --path gs://your-bucket/data/ --name warehouse
Terminal window
# Directory of CSV files
dinobase add csv --path ./exports/ --name reports
# Single CSV
dinobase add csv --path ./data/customers.csv --name customers

When you add a file source, Dinobase:

  1. Creates a schema with the name you provide
  2. Scans the path for matching files
  3. Creates a DuckDB view for each file
  4. The view references the file directly — no data is copied

Each file becomes a table named after its filename:

FileTable name
customers.parquetcustomers
event-log.parquetevent_log
2024_orders.csv2024_orders
InputBehavior
Directory (./data/)Finds all matching files, including subdirectories
Single file (./data/events.parquet)Creates one table
Glob pattern (./data/*.parquet)Matches files by pattern
S3 URL (s3://bucket/prefix/)DuckDB reads from S3 at query time
GCS URL (gs://bucket/prefix/)DuckDB reads from GCS at query time

For directories, Dinobase first looks for files in the top level, then searches recursively if none are found.

DuckDB can read parquet files directly from S3 and GCS. The data is fetched at query time — nothing is stored locally.

Terminal window
dinobase add parquet --path s3://my-data-lake/exports/ --name lake
dinobase query "SELECT COUNT(*) FROM lake.events" --pretty

For S3, DuckDB uses your AWS credentials from the environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or from ~/.aws/credentials.

File sources get basic metadata inferred from column names:

  • id columns are marked as primary keys
  • *_id columns are annotated as foreign keys
  • email columns are marked as join keys
  • Timestamp columns (created_at, etc.) are labeled
Terminal window
dinobase describe analytics.events --pretty

File sources never appear in dinobase sync output. They are always read at query time, so the data is always current:

Terminal window
# Update the file
cp new_data.parquet ./data/events.parquet
# Query reflects the new data immediately
dinobase query "SELECT COUNT(*) FROM analytics.events"

When to use file sources vs cloud storage sync

Section titled “When to use file sources vs cloud storage sync”
File sourcesCloud storage sync
Commanddinobase add parquet --path ...dinobase add s3 --bucket-url ...
SyncNone — reads at query timedlt syncs incrementally
Best forStatic exports, local files, S3/GCS parquetIncrementally growing file sets
Data locationStays in placeCopied to ~/.dinobase/