Improve Smart Reconstruction of SQL datasets flows

Marlan · July 2021

In situations where the dataset to be built and all dependencies of that dataset are SQL datasets, Smart Reconstruction does not actually rebuild any of the datasets (after the initial build). Per Clément_Stenac's post here, this is because there is no general way to determine if the data in a SQL table has changed.

While this behavior is certainly understandable, it is not typically what the user desires. At least that is our experience. In our use of DSS, flows of all SQL datasets are very common.

We most commonly run into this when creating build steps in Scenarios. Users don't realize (or forget) that they need change the build option from the default "Build required datasets" (equivalent to Smart Reconstruction) to "Force re-build dataset and dependencies".

As a result we have have a number of instances of Scenarios happily running for some time before the developer realizes that none of the tables they thought were getting refreshed actually were NOT getting refreshed. Often it is a customer telling us which is particularly embarrassing.

So the request is to improve the handling of rebuilding of flows with SQL dataset inputs. The goal is to avoid the experience of Scenarios running successfully but none of the datasets are getting updated.

It's not obvious though how to do this. Here are some ideas:

Make "force re-build" rather than "build required" the default in Scenario build steps (Scenarios are where the impacts are most problematic). This is perhaps overkill but it would solve the problem we experience.
As a variation on #1, enable companies to set the default build type for build steps in Scenarios. That way, companies like ours who mostly use SQL datasets can change the default to "force re-build".
Enhance the smart reconstruction / build required functionality to use force rebuild when all dependencies are SQL datasets.
For external SQL datasets, enable user to provide a query that will change when the underlying table is update. This would be equivalent to the SQL query trigger functionality in Scenarios. This could default to a row count (SELECT COUNT(*) FROM TABLE).

Of these options, #3 is my preferred option. This will work for flows that start with a SQL Script recipe (as ours often do since we may reference dozens of tables from our database and creating datasets for all of those would be big pain) rather than datasets pointing to external SQL tables.

Thanks to @sridarvs
for posting about this issue. I had been thinking about it already but seeing another company running into the same issue prompted me to write up the idea.

Thanks for considering this.

Marlan

Nate · July 2021

Many databases provide logs that record writes to tables. Simply scanning the logs for the most recent write transaction timestamp for a given table can provide reasonable default change detection. Then with a custom SQL option for change detection (I believe one already exists in Dataiku for differential ETLs!), more advanced configurations can be provided.

I have always made a habit of force-rebuilding in my automations, but this change detection would provide a much faster alternative to that. If my underlying data hasn't changed, I could skip reloading data. This would allow me to set up a much faster effective ETL frequency and make data for my users far more current.

I have some flows that take more than 80 hours to rebuild. 6-12 hours is common. If only the datasets that had new data were rebuilt, and those were rebuilt when changes are detected automatically, it could prevent the massive tables from moving when they haven't changed, allowing the small ones to be up to date.

Marlan · July 2021

Great point, @natejgardner
. Actually doing a smart reconstruction of SQL dataset flows would be awesome! I was thinking smaller and just focused on avoiding the "seems like refreshes are happening but they actually aren't" issue in my user community. But yeah if a way could be worked out to detect changes in SQL datasets and then do smart reconstruction, that would be great.

Marlan

Marlan · March 2022

Good to hear @ElisaS
- thanks!

Tanguy · October 2022

When one wants to update a flow after changing a few things in a pipeline, the smart build feature gets very handy:

you start from the item you need to update
you launch the smart build
dataiku then takes care of refreshing all the data where changes have occured to propagate them to the desired item

However, one can often be surprised by how far dataiku can go rebuilding upstream (and sometimes to heavy computation areas that can run for a very long term). Moreover, there can be changes that are without effect on the data (e.g. a simple save / commit to a recipe that does not change the output, like adding a newline or a comment, is detected as a change).

In the end, quite often the analyst did not not necessarily need to go that far upstream in the flow or did not want to update the heavy computation areas. This side-effect can considerably slow down the iteration process necessary for quick developement.

It would be great if dataiku offered the possibility for the user to stop a smart build to some items in the flow (without resorting to the "explicit rebuild" option in the advanced settings as one does not know in advance which item will be updated or not). For example, we could imagine that dataiku would highlight the items of the flow that would be updated using the smart build and the analyst could choose where to "block" further upstream builds.

CoreyS · October 2022

Hi @tanguy
while not quite the same idea I wanted to make you aware and connect you with @Marlan
who offered this previously

Tanguy · October 2022

Thank you Corey. Indeed, the post's general idea does follow Marlan's.

The suggested feature ("highlight the items of the flow that would be updated using the smart build and [to be able to] choose where to "block" further upstream builds") is not mentioned in that thread though.

Don't hesitate if you prefer to merge both posts, but I thought it would be interesting to seperate this specific idea to see if it gains traction.

CoreyS · October 2022

Please note a recent Product Idea was merged into this idea with vote totals and comments added to this idea thread.

Improve Smart Reconstruction of SQL datasets flows

In the Backlog · Last Updated July 2021

Comments

Categories

Setup Info

Tags