Read only few partitions from dataset in Spark recipe

fmonari
Level 2
Read only few partitions from dataset in Spark recipe

Hi, 

I am trying to understand partitioned dataset in DSS.

I have created a dataset that is partitioned by day (a folder for each day, Hive style: Day=2020-01-01), now I would like to read into a SprakSQL recipe only the partitions let's say after 2020-01-01. I tried to do "where Day >= '2020-01-01'", but I get an error because the column is not in the schema.

Can anyone explain me how to achieve this?

Thanks, 

Filippo

0 Kudos
2 Replies
EliasH
Dataiker

Hi @fmonari ,

You can achieve this by specifying in the run options of your SparkSQL recipe which partitions you'd like to build upon.Screen Shot 2021-07-12 at 5.36.54 PM.png

Your "Day" column was removed from the schema as expected because DSS redispatches partitioning according to your input columns during file-based partitioning.

Hope this helps, please let me know if you have any additional questions. 

Best,

Elias

0 Kudos
fmonari
Level 2
Author

Hi @EliasH,

here below the run options that are available to me. I cannot see the same option as you. Isa that because the output dataset needs to be partitioned as well? Is there a way to programmatically select the partitions? 

Regards,

Filippo

receipe_run_options.PNG

0 Kudos