Skip to main content
Version: Next

SQL Queries

Incubating

Important Capabilities

CapabilityStatusNotes
Column-level LineageParsed from SQL queries
Table-Level LineageParsed from SQL queries

This source reads a specifically-formatted JSON file containing SQL queries and parses them to generate lineage.

This file should contain one JSON object per line, with the following fields:

  • query: string - The SQL query to parse.
  • timestamp (optional): number - The timestamp of the query, in seconds since the epoch.
  • user (optional): string - The user who ran the query. This user value will be directly converted into a DataHub user urn.
  • operation_type (optional): string - Platform-specific operation type, used if the operation type can't be parsed.
  • downstream_tables (optional): string[] - Fallback list of tables that the query writes to, used if the query can't be parsed.
  • upstream_tables (optional): string[] - Fallback list of tables the query reads from, used if the query can't be parsed.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[sql-queries]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

datahub_api:  # Only necessary if using a non-DataHub sink, e.g. the file sink
server: http://localhost:8080
timeout_sec: 60
source:
type: sql-queries
config:
platform: "snowflake"
default_db: "SNOWFLAKE"
query_file: "./queries.json"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
platform 
string
The platform for which to generate data, e.g. snowflake
query_file 
string
Path to file to ingest
default_db
string
The default database to use for unqualified table names
default_schema
string
The default schema to use for unqualified table names
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to
env
string
The environment that all assets produced by this connector belong to
Default: PROD
usage
BaseUsageConfig
The usage config to use when generating usage statistics
Default: {'bucket_duration': 'DAY', 'end_time': '2023-12-08...
usage.bucket_duration
Enum
Size of the time window to aggregate usage stats.
Default: DAY
usage.end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
usage.format_sql_queries
boolean
Whether to format sql queries
Default: False
usage.include_operational_stats
boolean
Whether to display operational stats.
Default: True
usage.include_read_operational_stats
boolean
Whether to report read operational stats. Experimental.
Default: False
usage.include_top_n_queries
boolean
Whether to ingest the top_n_queries.
Default: True
usage.start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
usage.top_n_queries
integer
Number of top queries to save to each table.
Default: 10
usage.user_email_pattern
AllowDenyPattern
regex patterns for user emails to filter in usage.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
usage.user_email_pattern.allow
array(string)
usage.user_email_pattern.deny
array(string)
usage.user_email_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True

Example Queries File

{"query": "SELECT x FROM my_table", "timestamp": 1689232738.051, "user": "user_a", "downstream_tables": [], "upstream_tables": ["my_database.my_schema.my_table"]}
{"query": "INSERT INTO my_table VALUES (1, 'a')", "timestamp": 1689232737.669, "user": "user_b", "downstream_tables": ["my_database.my_schema.my_table"], "upstream_tables": []}

Note that this is not a valid standard JSON file, but rather a file containing one JSON object per line.

Code Coordinates

  • Class Name: datahub.ingestion.source.sql_queries.SqlQueriesSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for SQL Queries, feel free to ping us on our Slack.