File Sources
File sources are the simplest way to get data into Dinobase. No sync needed — DuckDB reads files directly at query time through views.
Adding file sources
Section titled “Adding file sources”Parquet files
Section titled “Parquet files”# Directory of parquet filesdinobase add parquet --path ./data/events/ --name analytics
# Single filedinobase add parquet --path ./data/export.parquet --name export
# Glob patterndinobase add parquet --path "./data/*.parquet" --name data
# S3dinobase add parquet --path s3://your-bucket/exports/ --name warehouse
# GCSdinobase add parquet --path gs://your-bucket/data/ --name warehouseCSV files
Section titled “CSV files”# Directory of CSV filesdinobase add csv --path ./exports/ --name reports
# Single CSVdinobase add csv --path ./data/customers.csv --name customersHow it works
Section titled “How it works”When you add a file source, Dinobase:
- Creates a schema with the name you provide
- Scans the path for matching files
- Creates a DuckDB view for each file
- The view references the file directly — no data is copied
Each file becomes a table named after its filename:
| File | Table name |
|---|---|
customers.parquet | customers |
event-log.parquet | event_log |
2024_orders.csv | 2024_orders |
Path resolution
Section titled “Path resolution”| Input | Behavior |
|---|---|
Directory (./data/) | Finds all matching files, including subdirectories |
Single file (./data/events.parquet) | Creates one table |
Glob pattern (./data/*.parquet) | Matches files by pattern |
S3 URL (s3://bucket/prefix/) | DuckDB reads from S3 at query time |
GCS URL (gs://bucket/prefix/) | DuckDB reads from GCS at query time |
For directories, Dinobase first looks for files in the top level, then searches recursively if none are found.
Remote files (S3, GCS)
Section titled “Remote files (S3, GCS)”DuckDB can read parquet files directly from S3 and GCS. The data is fetched at query time — nothing is stored locally.
dinobase add parquet --path s3://my-data-lake/exports/ --name lakedinobase query "SELECT COUNT(*) FROM lake.events" --prettyFor S3, DuckDB uses your AWS credentials from the environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or from ~/.aws/credentials.
Schema annotations
Section titled “Schema annotations”File sources get basic metadata inferred from column names:
idcolumns are marked as primary keys*_idcolumns are annotated as foreign keysemailcolumns are marked as join keys- Timestamp columns (
created_at, etc.) are labeled
dinobase describe analytics.events --prettyNo sync needed
Section titled “No sync needed”File sources never appear in dinobase sync output. They are always read at query time, so the data is always current:
# Update the filecp new_data.parquet ./data/events.parquet
# Query reflects the new data immediatelydinobase query "SELECT COUNT(*) FROM analytics.events"When to use file sources vs cloud storage sync
Section titled “When to use file sources vs cloud storage sync”| File sources | Cloud storage sync | |
|---|---|---|
| Command | dinobase add parquet --path ... | dinobase add s3 --bucket-url ... |
| Sync | None — reads at query time | dlt syncs incrementally |
| Best for | Static exports, local files, S3/GCS parquet | Incrementally growing file sets |
| Data location | Stays in place | Copied to ~/.dinobase/ |