How to setup partitions to run prepare recipe on Spark.

Options
gt
gt Registered Posts: 6 ✭✭✭

Hello,

I have data stored in S3 buckets with each day data in a date folder (%Y-%M-%D). However, I want to partition my data hourly based on filename. Each file is named as '%Y%M%D_%H...{guid}'. I an able to create hourly partitions using pattern "/.*/%Y%M%D_%H.*". This creates partitions but throws error when running a recipe on Spark.

I get below error -

Invalid partitioning

Can't resolve the path of this partition to a valid folder: /.*/20220817_10.*.

Thanks in advance for any help!

Tagged:

Answers

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    Options

    Hi @gt
    ,

    Could you please doublecheck that the partitioning pattern is correct on both the input and output dataset of the recipe?

    Based on the error message, I think you have an extra period at the end of the pattern for the output dataset, e.g. <font color="#000000">/.*/%Y%M%D_%H.*</font><font color="#FF0000"><strong>.</strong></font> instead of <font color="#000000">/.*/%Y%M%D_%H.*</font>

    If that doesn't resolve the issue, please post screenshots of the partitioning settings for both the input and output dataset. Also, does the error occur for every partition that's being built, or just for the 20220817_10 partition?

    Thanks,

    Zach

  • gt
    gt Registered Posts: 6 ✭✭✭
    Options

    Hi @ZachM
    ,

    Thank you for your response.
    There is no extra period at the end of pattern for the output dataset. The same works when running using DSS local stream but fails for spark only. The error occurs for all partitions.

    Below are the screenshots of the partition settings for both input and output datasets.

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    Options

    Hi @gt
    ,

    Thank you for providing the screenshots.

    Your partition settings look good to me, so I'm not sure what would be causing this error.

    Could you please open a support ticket so that we can further assist with this issue? You can open a ticket by going to https://support.dataiku.com/support/tickets/new

    In the description, please include a link to this community post, as well as a job diagnostic of the failing job.

    To create a job diagnostic, from the job page, click on Actions > Download job diagnosis.
    If the resulting file is too large to attach (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file.

    Thanks

  • Henk
    Henk Dataiku DSS Core Designer, Registered Posts: 3
    Options

    Hello ZachM,

    I experience exactly the same issue. The Sync recipe with DSS engine runs fine, however the Sync recipe with Spark engine fails. Did the support team find any issues?

    Could the issue be spaces in the filename? For example:

    "root / folder / data / partition_date=20221030093000 / parquet_file_1234"

    with partition pattern: /partition_date=%Y%M%D.*

    causing the error that partition "/partition_date=20221030" can't be found?

    Best Regards,

    Henk

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    edited July 17
    Options

    Hi @Henk
    ,

    We determined that the issue is that when using the Spark engine with Parquet-formatted files, each partition must be in its own directory. In other words, the partitioning pattern must end with "/.*".

    It looks like your file structure already has each partition in its own directory, so you could fix it by using a partitioning pattern like this:

    /partition_date=%Y%M%D.*/.*

    Thanks,

    Zach

  • gt
    gt Registered Posts: 6 ✭✭✭
    Options

    Hi @ZachM
    ,

    I am working on a dataset partitioned similar to your above comment.

    The folder are in the format: "/dt=%Y-%M-%D/.*". When I use the same format in partitioning, I see the error "Oops: Unexpected error occurred". Attached are the folder structure in s3 and the partitioning error.

  • gt
    gt Registered Posts: 6 ✭✭✭
    Options

    Hi @ZachM
    ,

    Below is the attachment with the actual partition I used.

    Best,

    Gowtham.

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    Options

    Hi @gt
    ,

    We'll need an instance diagnosis to further troubleshoot this error. Could you please create a new support ticket, and attach an instance diagnosis to the ticket?

    Please reproduce the error right before creating the instance diagnosis so that it shows up in the logs. To create an instance diagnosis, go to Administration > Maintenance > Diagnostic tool. Note that you need to be the administrator of the DSS instance - else you'll need to ask your admin.

    If the resulting file is too large to attach (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file.

    Thanks,

    Zach

Setup Info
    Tags
      Help me…