We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

Improve Smart Reconstruction of SQL datasets flows

In situations where the dataset to be built and all dependencies of that dataset are SQL datasets, Smart Reconstruction does not actually rebuild any of the datasets (after the initial build). Per Clément_Stenac's post here, this is because there is no general way to determine if the data in a SQL table has changed. 

While this behavior is certainly understandable, it is not typically what the user desires. At least that is our experience. In our use of DSS, flows of all SQL datasets are very common. 

We most commonly run into this when creating build steps in Scenarios. Users don't realize (or forget) that they need change the build option from the default "Build required datasets" (equivalent to Smart Reconstruction) to "Force re-build dataset and dependencies". 

As a result we have have a number of instances of Scenarios happily running for some time before the developer realizes that none of the tables they thought were getting refreshed actually were NOT getting refreshed.  Often it is a customer telling us which is particularly embarrassing.

So the request is to improve the handling of rebuilding of flows with SQL dataset inputs. The goal is to avoid the experience of Scenarios running successfully but none of the datasets are getting updated. 

It's not obvious though how to do this. Here are some ideas:

  1. Make "force re-build" rather than "build required" the default in Scenario build steps (Scenarios are where the impacts are most problematic).  This is perhaps overkill but it would solve the problem we experience.
  2. As a variation on #1, enable companies to set the default build type for build steps in Scenarios. That way, companies like ours who mostly use SQL datasets can change the default to "force re-build".
  3. Enhance the smart reconstruction / build required functionality to use force rebuild when all dependencies are SQL datasets. 
  4. For external SQL datasets, enable user to provide a query that will change when the underlying table is update. This would be equivalent to the SQL query trigger functionality in Scenarios. This could default to a row count (SELECT COUNT(*) FROM TABLE). 

Of these options, #3 is my preferred option. This will work for flows that start with a SQL Script recipe (as ours often do since we may reference dozens of tables from our database and creating datasets for all of those would be big pain) rather than datasets pointing to external SQL tables. 

Thanks to @sridarvs for posting about this issue. I had been thinking about it already but seeing another company running into the same issue prompted me to write up the idea.

Thanks for considering this.


Level 6

Many databases provide logs that record writes to tables. Simply scanning the logs for the most recent write transaction timestamp for a given table can provide reasonable default change detection. Then with a custom SQL option for change detection (I believe one already exists in Dataiku for differential ETLs!), more advanced configurations can be provided.

I have always made a habit of force-rebuilding in my automations, but this change detection would provide a much faster alternative to that. If my underlying data hasn't changed, I could skip reloading data. This would allow me to set up a much faster effective ETL frequency and make data for my users far more current.

I have some flows that take more than 80 hours to rebuild. 6-12 hours is common. If only the datasets that had new data were rebuilt, and those were rebuilt when changes are detected automatically, it could prevent the massive tables from moving when they haven't changed, allowing the small ones to be up to date.




Great point, @natejgardner.  Actually doing a smart reconstruction of SQL dataset flows would be awesome! I was thinking smaller and just focused on avoiding the "seems like refreshes are happening but they actually aren't" issue in my user community. But yeah if a way could be worked out to detect changes in SQL datasets and then do smart reconstruction, that would be great.