Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Dataset virtualization

Solved!
Dataset virtualization

Hi All,

I am trying to understand how virtualization in DSS works. In the following example, SQL pipelines are enabled and virtualization is allowed for 'split_1' and 'split_2'. When building 'stacked' with smart reconstruction, 'split_1' and 'split_2' remain unbuilt (virtualized) as expected.

 

Screenshot 2023-01-25 200600.png

However, in the next example, 'split_2' is created explicitly when building 'split_1_prepared' with smart reconstruction. Is this a bug or expected behavior? And if this is expected behavior, why?

Screenshot 2023-01-25 201513.png


 


Operating system used: CentOS

yashpuranik
0 Kudos
1 Solution
SarinaS
Dataiker

Hi @yashpuranik,

Thank you you for query and examples! Indeed this is expected behavior.

When using SQL pipelines, the initial "input" datasets and the final "output" datasets cannot be virtualized. So if a dataset is not intermediate (i.e. it doesn't exist in the flow between two datasets), then it's not possible to virtualize the dataset. Virtualization usually only makes sense in the context that the data will eventually be used in an ultimate written dataset. So the assumption is that all output datasets do need to be created. Rebuliding the flow will rebuild the split recipe, so any output datasets of the split recipe will be created. If you truly weren't going to use the data in "split_2", then selecting the split option "drop data" instead of creating the dataset "split_2" would probably be the best option. 

Let me know if that doesn't make sense or if you have any additional questions. 

Thanks,
Sarina

View solution in original post

0 Kudos
2 Replies
SarinaS
Dataiker

Hi @yashpuranik,

Thank you you for query and examples! Indeed this is expected behavior.

When using SQL pipelines, the initial "input" datasets and the final "output" datasets cannot be virtualized. So if a dataset is not intermediate (i.e. it doesn't exist in the flow between two datasets), then it's not possible to virtualize the dataset. Virtualization usually only makes sense in the context that the data will eventually be used in an ultimate written dataset. So the assumption is that all output datasets do need to be created. Rebuliding the flow will rebuild the split recipe, so any output datasets of the split recipe will be created. If you truly weren't going to use the data in "split_2", then selecting the split option "drop data" instead of creating the dataset "split_2" would probably be the best option. 

Let me know if that doesn't make sense or if you have any additional questions. 

Thanks,
Sarina

0 Kudos
yashpuranik
Author

Thanks @SarinaS ! This makes sense to me

yashpuranik