Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on July 19, 2021 5:36PM
Likes: 13
Replies: 7
In situations where the dataset to be built and all dependencies of that dataset are SQL datasets, Smart Reconstruction does not actually rebuild any of the datasets (after the initial build). Per Clément_Stenac's post here, this is because there is no general way to determine if the data in a SQL table has changed.
While this behavior is certainly understandable, it is not typically what the user desires. At least that is our experience. In our use of DSS, flows of all SQL datasets are very common.
We most commonly run into this when creating build steps in Scenarios. Users don't realize (or forget) that they need change the build option from the default "Build required datasets" (equivalent to Smart Reconstruction) to "Force re-build dataset and dependencies".
As a result we have have a number of instances of Scenarios happily running for some time before the developer realizes that none of the tables they thought were getting refreshed actually were NOT getting refreshed. Often it is a customer telling us which is particularly embarrassing.
So the request is to improve the handling of rebuilding of flows with SQL dataset inputs. The goal is to avoid the experience of Scenarios running successfully but none of the datasets are getting updated.
It's not obvious though how to do this. Here are some ideas:
Of these options, #3 is my preferred option. This will work for flows that start with a SQL Script recipe (as ours often do since we may reference dozens of tables from our database and creating datasets for all of those would be big pain) rather than datasets pointing to external SQL tables.
Thanks to @sridarvs
for posting about this issue. I had been thinking about it already but seeing another company running into the same issue prompted me to write up the idea.
Thanks for considering this.
Marlan
Many databases provide logs that record writes to tables. Simply scanning the logs for the most recent write transaction timestamp for a given table can provide reasonable default change detection. Then with a custom SQL option for change detection (I believe one already exists in Dataiku for differential ETLs!), more advanced configurations can be provided.
I have always made a habit of force-rebuilding in my automations, but this change detection would provide a much faster alternative to that. If my underlying data hasn't changed, I could skip reloading data. This would allow me to set up a much faster effective ETL frequency and make data for my users far more current.
I have some flows that take more than 80 hours to rebuild. 6-12 hours is common. If only the datasets that had new data were rebuilt, and those were rebuilt when changes are detected automatically, it could prevent the massive tables from moving when they haven't changed, allowing the small ones to be up to date.
Great point, @natejgardner
. Actually doing a smart reconstruction of SQL dataset flows would be awesome! I was thinking smaller and just focused on avoiding the "seems like refreshes are happening but they actually aren't" issue in my user community. But yeah if a way could be worked out to detect changes in SQL datasets and then do smart reconstruction, that would be great.
Marlan
When one wants to update a flow after changing a few things in a pipeline, the smart build feature gets very handy:
However, one can often be surprised by how far dataiku can go rebuilding upstream (and sometimes to heavy computation areas that can run for a very long term). Moreover, there can be changes that are without effect on the data (e.g. a simple save / commit to a recipe that does not change the output, like adding a newline or a comment, is detected as a change).
In the end, quite often the analyst did not not necessarily need to go that far upstream in the flow or did not want to update the heavy computation areas. This side-effect can considerably slow down the iteration process necessary for quick developement.
It would be great if dataiku offered the possibility for the user to stop a smart build to some items in the flow (without resorting to the "explicit rebuild" option in the advanced settings as one does not know in advance which item will be updated or not). For example, we could imagine that dataiku would highlight the items of the flow that would be updated using the smart build and the analyst could choose where to "block" further upstream builds.
Thank you Corey. Indeed, the post's general idea does follow Marlan's.
The suggested feature ("highlight the items of the flow that would be updated using the smart build and [to be able to] choose where to "block" further upstream builds") is not mentioned in that thread though.
Don't hesitate if you prefer to merge both posts, but I thought it would be interesting to seperate this specific idea to see if it gains traction.
Please note a recent Product Idea was merged into this idea with vote totals and comments added to this idea thread.