Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Read only few partitions from dataset in Spark recipe

fmonari
Level 2
Read only few partitions from dataset in Spark recipe

Hi, 

I am trying to understand partitioned dataset in DSS.

I have created a dataset that is partitioned by day (a folder for each day, Hive style: Day=2020-01-01), now I would like to read into a SprakSQL recipe only the partitions let's say after 2020-01-01. I tried to do "where Day >= '2020-01-01'", but I get an error because the column is not in the schema.

Can anyone explain me how to achieve this?

Thanks, 

Filippo

0 Kudos
2 Replies
EliasH
Dataiker
Dataiker

Hi @fmonari ,

You can achieve this by specifying in the run options of your SparkSQL recipe which partitions you'd like to build upon.Screen Shot 2021-07-12 at 5.36.54 PM.png

Your "Day" column was removed from the schema as expected because DSS redispatches partitioning according to your input columns during file-based partitioning.

Hope this helps, please let me know if you have any additional questions. 

Best,

Elias

0 Kudos
fmonari
Level 2
Author

Hi @EliasH,

here below the run options that are available to me. I cannot see the same option as you. Isa that because the output dataset needs to be partitioned as well? Is there a way to programmatically select the partitions? 

Regards,

Filippo

receipe_run_options.PNG

0 Kudos