Spark Pipeline Initial Dataset Creation

aw30 · July 2020

I just set up my first flow enabling Spark Pipeline and have a question on the initial setup of a dataset. When I am adding recipes I need to identify a dataset as output. I then go back and check the box Virtualizable in Build. I run my flow and I can see that the intermediary dataset was not built but the dataset still remains. I am wondering if there is a step I am missing so these intermediary files aren't sitting out on storage so they truly are just virtual?

Thank you for all the help and assistance!

dimitri · July 2020

Hi @aw30
,

Enabling the Virtualizable in build option of a dataset, allows it to be virtualized only when the job you run doesn't require it to be materialized. If the virtualizable dataset is the output of your job, then it is considered as required to be built, as so it won't be virtualized.

Also, enabling the virtualizable in build option has no impact on the potential existing data, so if the dataset already exists and you want to clear its data, you have to handle that independently.

Therefore, when building a flow, if you run all the new recipes one by one, all your datasets will be materialized since they have all successively been the output of a job, even if they have been set as virtualizable before each run. You can then clear, and rerun the entire pipeline to effectively get the intermediate datasets virtualized.

Thus, if you want to prevent building intermediate datasets when building the flow, you need to create a first recipe and configure the output dataset as virtualizable, then add another recipe before running the first recipe, so that this dataset becomes an intermediate dataset which will not be required to be built. If the upstream and downstream recipes of this dataset are configured to run with the spark engine, then you have a Spark pipeline with a virtualized intermediate dataset that won't be materialized as long as it is not required by any job.

I hope it helps.

dimitri · July 2020

Hi Anne,

You can clear the data of these datasets from UI. In the flow, select one or multiple datasets (shift + click) and use the clear data option available in the right-hand side panel. (The same option is also available from the dataset page).

Also, you can use the API with the dataset.clear() method.

Have a great day!

aw30 · July 2020

Hi Dimitri,

Thank you for your explanation and it is very helpful! When you say "you can then clear" what action is this that I would do to clear the dataset off storage?

Thank you again!

Anne

aw30 · July 2020

Thank you again!

Spark Pipeline Initial Dataset Creation

Best Answers

Answers

Categories

Setup Info

Tags