Handling double query execution and compute costs with Databricks SQL

aburkh · April 14

Hi everyone,

Our team relies heavily on Dataiku integrated with Databricks SQL, and we’ve recently been investigating our compute costs. We noticed a specific behavior with standard SQL Query Recipes that we'd love to get the community's thoughts on.

The Observation

When building a new dataset using a SQL Query Recipe (with "drop and recreate" enabled), we observe via the query logs that our statement is executed twice on the Databricks side:

First, as a standalone SELECT statement, which we assume is used by Dataiku for schema detection and validation.
Immediately after, wrapped within an INSERT INTO statement to actually populate the target table.

We noticed that Dataiku does not append a LIMIT 0 to that initial schema-detection query. As a result, Databricks processes the full execution plan and underlying data twice.

The Cost Impact

Because Databricks (and Spark SQL generally) evaluates queries dynamically, running heavy ELT transformations twice per recipe effectively doubles the warehouse compute time and the associated DBU costs. In a native Databricks environment, this double execution is usually bypassed by using a single CREATE TABLE AS SELECT (CTAS) statement.

Questions for the Community:

Workarounds: Has anyone else encountered this and found a seamless workaround? We are currently exploring moving these workloads to SQL Script recipes (using explicit CTAS) or PySpark, but we lose some of the automated schema-sync conveniences of the standard SQL Query Recipe.
Settings: Is there a specific configuration toggle or advanced JDBC property we might have missed that forces Dataiku to only read metadata or skip the pre-execution check?
Future Enhancements: Are there any known roadmap items to support native CTAS execution for Databricks SQL connections to avoid this overhead?

Looking forward to hearing how others are managing Databricks execution costs within Dataiku!

Turribeach · April 17

What version of Dataiku are you using? In my tests Dataiku always issued LIMIT 10 on the metadata queries. What is the source of your SQL Query? Is it another Databricks dataset? Is that dataset a single managed or unmanaged Databricks table?

Handling double query execution and compute costs with Databricks SQL

The Observation

The Cost Impact

Questions for the Community:

Comments

Categories

Setup Info

Tags