Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

How to setup partitions to run prepare recipe on Spark.

gt
Level 2
How to setup partitions to run prepare recipe on Spark.

Hello,

I have data stored in S3 buckets with each day data in a date folder (%Y-%M-%D). However, I want to partition my data hourly based on filename. Each file is named as '%Y%M%D_%H...{guid}'. I an able to create hourly partitions  using pattern "/.*/%Y%M%D_%H.*". This creates partitions but throws error when running a recipe on Spark.

I get below error - 

 Invalid partitioning

Can't resolve the path of this partition to a valid folder: /.*/20220817_10.*.

Thanks in advance for any help!

0 Kudos
8 Replies
ZachM
Dataiker

Hi @gt ,

Could you please doublecheck that the partitioning pattern is correct on both the input and output dataset of the recipe?

Based on the error message, I think you have an extra period at the end of the pattern for the output dataset, e.g. /.*/%Y%M%D_%H.*. instead of /.*/%Y%M%D_%H.*

If that doesn't resolve the issue, please post screenshots of the partitioning settings for both the input and output dataset. Also, does the error occur for every partition that's being built, or just for the 20220817_10 partition?

 

Thanks,

Zach

0 Kudos
gt
Level 2
Author

Hi @ZachM ,

Thank you for your response.
There is no extra period at the end of pattern for the output dataset. The same works when running using DSS local stream but fails for spark only.  The error occurs for all partitions.

Below are the screenshots of the partition settings for both input and output datasets.

0 Kudos
ZachM
Dataiker

Hi @gt ,

Thank you for providing the screenshots.

Your partition settings look good to me, so I'm not sure what would be causing this error.

Could you please open a support ticket so that we can further assist with this issue? You can open a ticket by going to https://support.dataiku.com/support/tickets/new

In the description, please include a link to this community post, as well as a job diagnostic of the failing job.

To create a job diagnostic, from the job page, click on Actions > Download job diagnosis.
If the resulting file is too large to attach (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file.

Thanks

0 Kudos
Henk
Level 1

Hello ZachM,

I experience exactly the same issue.  The Sync recipe with DSS engine runs fine, however the Sync recipe with Spark engine fails.  Did the support team find any issues? 

Could the issue be spaces in the filename? For example:

    "root / folder data / partition_date=20221030093000 / parquet_file_1234"

with partition pattern: /partition_date=%Y%M%D.*

causing the error that partition "/partition_date=20221030" can't be found?

Best Regards,

Henk

0 Kudos
ZachM
Dataiker

Hi @Henk,

We determined that the issue is that when using the Spark engine with Parquet-formatted files, each partition must be in its own directory. In other words, the partitioning pattern must end with "/.*".

It looks like your file structure already has each partition in its own directory, so you could fix it by using a partitioning pattern like this:

/partition_date=%Y%M%D.*/.*

 

Thanks,

Zach

gt
Level 2
Author

Hi @ZachM ,

I am working on a dataset partitioned similar to your above comment.

 

The folder are in the format: "/dt=%Y-%M-%D/.*". When I use the same format in partitioning, I see the error "Oops: Unexpected error occurred". Attached are the folder structure in s3 and the partitioning error. 

0 Kudos
gt
Level 2
Author

Hi @ZachM ,

Below is the attachment with the actual partition I used.

Best,

Gowtham.

0 Kudos
ZachM
Dataiker

Hi @gt,

We'll need an instance diagnosis to further troubleshoot this error. Could you please create a new support ticket, and attach an instance diagnosis to the ticket?

Please reproduce the error right before creating the instance diagnosis so that it shows up in the logs. To create an instance diagnosis, go to Administration > Maintenance > Diagnostic tool. Note that you need to be the administrator of the DSS instance - else you'll need to ask your admin.

If the resulting file is too large to attach (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file.

Thanks,

Zach

0 Kudos