Read only few partitions from dataset in Spark recipe

fmonari · ‎07-12-2021

Hi,

I am trying to understand partitioned dataset in DSS.

I have created a dataset that is partitioned by day (a folder for each day, Hive style: Day=2020-01-01), now I would like to read into a SprakSQL recipe only the partitions let's say after 2020-01-01. I tried to do "where Day >= '2020-01-01'", but I get an error because the column is not in the schema.

Can anyone explain me how to achieve this?

Thanks,

Filippo

EliasH · ‎07-13-2021

Hi @fmonari ,

You can achieve this by specifying in the run options of your SparkSQL recipe which partitions you'd like to build upon.

Your "Day" column was removed from the schema as expected because DSS redispatches partitioning according to your input columns during file-based partitioning.

Hope this helps, please let me know if you have any additional questions.

Best,

Elias

fmonari · ‎07-13-2021

Hi @EliasH,

here below the run options that are available to me. I cannot see the same option as you. Isa that because the output dataset needs to be partitioned as well? Is there a way to programmatically select the partitions?

Regards,

Filippo

Sign up to take part

Read only few partitions from dataset in Spark recipe

Read only few partitions from dataset in Spark recipe