Handling double query execution and compute costs with Databricks SQL
Hi everyone,
Our team relies heavily on Dataiku integrated with Databricks SQL, and we’ve recently been investigating our compute costs. We noticed a specific behavior with standard SQL Query Recipes that we'd love to get the community's thoughts on.
The Observation
When building a new dataset using a SQL Query Recipe (with "drop and recreate" enabled), we observe via the query logs that our statement is executed twice on the Databricks side:
- First, as a standalone
SELECTstatement, which we assume is used by Dataiku for schema detection and validation. - Immediately after, wrapped within an
INSERT INTOstatement to actually populate the target table.
We noticed that Dataiku does not append a LIMIT 0 to that initial schema-detection query. As a result, Databricks processes the full execution plan and underlying data twice.
The Cost Impact
Because Databricks (and Spark SQL generally) evaluates queries dynamically, running heavy ELT transformations twice per recipe effectively doubles the warehouse compute time and the associated DBU costs. In a native Databricks environment, this double execution is usually bypassed by using a single CREATE TABLE AS SELECT (CTAS) statement.
Questions for the Community:
- Workarounds: Has anyone else encountered this and found a seamless workaround? We are currently exploring moving these workloads to SQL Script recipes (using explicit CTAS) or PySpark, but we lose some of the automated schema-sync conveniences of the standard SQL Query Recipe.
- Settings: Is there a specific configuration toggle or advanced JDBC property we might have missed that forces Dataiku to only read metadata or skip the pre-execution check?
- Future Enhancements: Are there any known roadmap items to support native CTAS execution for Databricks SQL connections to avoid this overhead?
Looking forward to hearing how others are managing Databricks execution costs within Dataiku!
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,696 NeuronWhat version of Dataiku are you using? In my tests Dataiku always issued LIMIT 10 on the metadata queries. What is the source of your SQL Query? Is it another Databricks dataset? Is that dataset a single managed or unmanaged Databricks table?