Dataset virtualization

yashpuranik · January 2023

Hi All,

I am trying to understand how virtualization in DSS works. In the following example, SQL pipelines are enabled and virtualization is allowed for 'split_1' and 'split_2'. When building 'stacked' with smart reconstruction, 'split_1' and 'split_2' remain unbuilt (virtualized) as expected.

Screenshot 2023-01-25 200600.png

However, in the next example, 'split_2' is created explicitly when building 'split_1_prepared' with smart reconstruction. Is this a bug or expected behavior? And if this is expected behavior, why?

Screenshot 2023-01-25 201513.png

Operating system used: CentOS

Sarina · January 2023

Hi @yashpuranik
,

Thank you you for query and examples! Indeed this is expected behavior.

When using SQL pipelines, the initial "input" datasets and the final "output" datasets cannot be virtualized. So if a dataset is not intermediate (i.e. it doesn't exist in the flow between two datasets), then it's not possible to virtualize the dataset. Virtualization usually only makes sense in the context that the data will eventually be used in an ultimate written dataset. So the assumption is that all output datasets do need to be created. Rebuliding the flow will rebuild the split recipe, so any output datasets of the split recipe will be created. If you truly weren't going to use the data in "split_2", then selecting the split option "drop data" instead of creating the dataset "split_2" would probably be the best option.

Let me know if that doesn't make sense or if you have any additional questions.

Thanks,
Sarina

yashpuranik · January 2023

Thanks @SarinaS
! This makes sense to me

Dataset virtualization

Best Answer

Answers

Categories

Setup Info

Tags