
gittech. site
for different kinds of informations and explorations.
Lotad, an Open-Source DuckDB Differ for Data Exploration
lotad
A Python library for tracking data drift between DuckDB databases. Helps identify schema changes, differences in data, and structural modifications between versions. Built as an exploratory tool with minimal setup required. Particularly useful for assessing downstream pipeline impacts.
Features
- Compare schemas and data between DuckDB databases
- Write changes to dedicated tables matching original schemas for easy visualization
- No primary key requirement
- Support for string-encoded and url-encoded JSON sorting
- Detect missing tables, columns and type mismatches
- Analyze row differences with consistent hashing
- Generate detailed comparison reports
- Configure excluded/included tables with regex support
- Specify excluded columns for each table
Quick Start
Install
Must be 3.12+
pip install lotad
How to use
# Create a config file to quickly re-run the same diff check on 2 databases
lotad setup --config lotad_config.yaml
# To perform the diff check
lotad run --config lotad_config.yaml
# Or you can pass in a subset of the config params directly to the run command.
lotad run --help
Checking results
A DuckDB file is created in the path set in the config
but defaults to drift_analysis.db
in the current directory if not set in the config.
For each table with data drift a table will be created within it. The generated table will contain the combined schema of the 2 dbs plus the following metadata columns generated by lotad.
observed_in
the db the row was inhashed_row
a hash based representation of the row excluding ignored columns
These tables will also be created which contain summary level information
lotad_db_data_drift_summary
lotad_missing_table_drift
lotad_table_schema_drift
License
This project is licensed under the MIT License.