Read only few partitions from dataset in Spark recipe

fmonari · July 2021

Hi,

I am trying to understand partitioned dataset in DSS.

I have created a dataset that is partitioned by day (a folder for each day, Hive style: Day=2020-01-01), now I would like to read into a SprakSQL recipe only the partitions let's say after 2020-01-01. I tried to do "where Day >= '2020-01-01'", but I get an error because the column is not in the schema.

Can anyone explain me how to achieve this?

Thanks,

Filippo

EliasH · July 2021

Hi @fmonari
,

You can achieve this by specifying in the run options of your SparkSQL recipe which partitions you'd like to build upon. Screen Shot 2021-07-12 at 5.36.54 PM.png

Your "Day" column was removed from the schema as expected because DSS redispatches partitioning according to your input columns during file-based partitioning.

Hope this helps, please let me know if you have any additional questions.

Best,

Elias

fmonari · July 2021

Hi @EliasH
,

here below the run options that are available to me. I cannot see the same option as you. Isa that because the output dataset needs to be partitioned as well? Is there a way to programmatically select the partitions?

Regards,

Filippo

Read only few partitions from dataset in Spark recipe

Answers

Categories

Setup Info

Tags