Read only few partitions from dataset in Spark recipe

Options
fmonari
fmonari Registered Posts: 18 ✭✭✭✭

Hi,

I am trying to understand partitioned dataset in DSS.

I have created a dataset that is partitioned by day (a folder for each day, Hive style: Day=2020-01-01), now I would like to read into a SprakSQL recipe only the partitions let's say after 2020-01-01. I tried to do "where Day >= '2020-01-01'", but I get an error because the column is not in the schema.

Can anyone explain me how to achieve this?

Thanks,

Filippo

Answers

  • EliasH
    EliasH Dataiker, Registered Posts: 34 Dataiker
    Options

    Hi @fmonari
    ,

    You can achieve this by specifying in the run options of your SparkSQL recipe which partitions you'd like to build upon.Screen Shot 2021-07-12 at 5.36.54 PM.png

    Your "Day" column was removed from the schema as expected because DSS redispatches partitioning according to your input columns during file-based partitioning.

    Hope this helps, please let me know if you have any additional questions.

    Best,

    Elias

  • fmonari
    fmonari Registered Posts: 18 ✭✭✭✭
    Options

    Hi @EliasH
    ,

    here below the run options that are available to me. I cannot see the same option as you. Isa that because the output dataset needs to be partitioned as well? Is there a way to programmatically select the partitions?

    Regards,

    Filippo

    receipe_run_options.PNG

Setup Info
    Tags
      Help me…