Do you know the History of Data Science? READ MORE

Smart Reconstruction Query

Solved!
sridarvs
Level 1
Smart Reconstruction Query

Hello,

We have a flow that loads data into Snowflake through a sequence of mostly SQL Recipes. Each recipe loads data into a temp table and the final recipe loads data into target table.

We have this flow running in a scenario with Build mode configured as "Build required dataset"

The problem am seeing is that:

a) I did a quick test to drop the intermediate table. I was hoping that, DSS will recreate the table and rebuild all dataset. But, the job failed saying the intermediate table failed.

b) I also did a quick test to delete the records in target table and was hoping that DSS will refresh the data. The job completed successfully but there were no data loaded to target.

We had to then change the rebuild mode to "Force rebuild datasets".

I was hoping that DSS Smart reconstruction will handle this above 2 points (a) and (b). Could someone confirm my understanding was wrong?

Regards,

0 Kudos
1 Solution
Clément_Stenac
Dataiker
Dataiker

Hi,

There are a number of points here:

 

Managed datasets

Datasets created when creating recipes (i.e. "intermediate" and "output" datasets) are so-called "managed" datasets (https://doc.dataiku.com/dss/latest/concepts/index.html#managed-and-external-datasets). On managed datasets, DSS considers that it "controls" the dataset, and that all operations happen through it. This allows DSS to reason directly about the state of the dataset without having to perform costly queries on data sources.

By dropping the table outside of DSS, you break this assumption: DSS remembers that it built this table and that it did not drop it, so considers it as available and ready.

DSS only performs actual data access on the "source" (i.e. "external") datasets which are deemed outside of its control

 

Change detection on SQL datasets

Actual change detection is only performed on "files-based" datasets (i.e. Upload, Filesystem, HDFS, S3, Azure Blob, GCS, FTP, SFTP, ...). DSS performs this change detection by grabbing the complete file listing including size and last-modified dates. If a file was changed, DSS considers that the dataset changed and will appropriately propagate changes.

On the other hand, on SQL datasets, there is no universal means to know whether a table "changed" or not. DSS cannot automatically detect changes on SQL datasets. In order for incremental reconstruction to ever work, DSS considers that data may have not changed and does not rebuild.

What you may want to do is, when you know that the source data has changed, run in "forced" mode only the first recipe from the source dataset, and then perform an incremental ("build required") from the end: the first "forced" build has modified the first managed dataset, so DSS knows that the rest needs to be rebuilt.

Note that DSS does not perform "update/upsert" kind of builds. If a dataset has changed, it gets rebuilt. The only exception is partitioned datasets.

Hope this helps,

View solution in original post

2 Replies
Clément_Stenac
Dataiker
Dataiker

Hi,

There are a number of points here:

 

Managed datasets

Datasets created when creating recipes (i.e. "intermediate" and "output" datasets) are so-called "managed" datasets (https://doc.dataiku.com/dss/latest/concepts/index.html#managed-and-external-datasets). On managed datasets, DSS considers that it "controls" the dataset, and that all operations happen through it. This allows DSS to reason directly about the state of the dataset without having to perform costly queries on data sources.

By dropping the table outside of DSS, you break this assumption: DSS remembers that it built this table and that it did not drop it, so considers it as available and ready.

DSS only performs actual data access on the "source" (i.e. "external") datasets which are deemed outside of its control

 

Change detection on SQL datasets

Actual change detection is only performed on "files-based" datasets (i.e. Upload, Filesystem, HDFS, S3, Azure Blob, GCS, FTP, SFTP, ...). DSS performs this change detection by grabbing the complete file listing including size and last-modified dates. If a file was changed, DSS considers that the dataset changed and will appropriately propagate changes.

On the other hand, on SQL datasets, there is no universal means to know whether a table "changed" or not. DSS cannot automatically detect changes on SQL datasets. In order for incremental reconstruction to ever work, DSS considers that data may have not changed and does not rebuild.

What you may want to do is, when you know that the source data has changed, run in "forced" mode only the first recipe from the source dataset, and then perform an incremental ("build required") from the end: the first "forced" build has modified the first managed dataset, so DSS knows that the rest needs to be rebuilt.

Note that DSS does not perform "update/upsert" kind of builds. If a dataset has changed, it gets rebuilt. The only exception is partitioned datasets.

Hope this helps,

View solution in original post

sridarvs
Level 1
Author

Thanks Clement, 

Very clear.

Regards,

0 Kudos
A banner prompting to get Dataiku DSS