Dataset virtualization

Options
yashpuranik
yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

Hi All,

I am trying to understand how virtualization in DSS works. In the following example, SQL pipelines are enabled and virtualization is allowed for 'split_1' and 'split_2'. When building 'stacked' with smart reconstruction, 'split_1' and 'split_2' remain unbuilt (virtualized) as expected.

Screenshot 2023-01-25 200600.png

However, in the next example, 'split_2' is created explicitly when building 'split_1_prepared' with smart reconstruction. Is this a bug or expected behavior? And if this is expected behavior, why?

Screenshot 2023-01-25 201513.png



Operating system used: CentOS

Best Answer

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    Answer ✓
    Options

    Hi @yashpuranik
    ,

    Thank you you for query and examples! Indeed this is expected behavior.

    When using SQL pipelines, the initial "input" datasets and the final "output" datasets cannot be virtualized. So if a dataset is not intermediate (i.e. it doesn't exist in the flow between two datasets), then it's not possible to virtualize the dataset. Virtualization usually only makes sense in the context that the data will eventually be used in an ultimate written dataset. So the assumption is that all output datasets do need to be created. Rebuliding the flow will rebuild the split recipe, so any output datasets of the split recipe will be created. If you truly weren't going to use the data in "split_2", then selecting the split option "drop data" instead of creating the dataset "split_2" would probably be the best option.

    Let me know if that doesn't make sense or if you have any additional questions.

    Thanks,
    Sarina

Answers

Setup Info
    Tags
      Help me…