New to Dataiku DSS? Try out our NEW Quick Start Programs today and get onboarded on the product in just one hour! Let's go

I used to be able to partition with spaces in the identifier however now am not in a new instance

Solved!
kathyqingyuxu
Level 2
I used to be able to partition with spaces in the identifier however now am not in a new instance

Hello,

I have a question on partitioning. I used to be able to partition with partition identifiers that have a space. Example: "Bob Smith, Anne Xu". However, we are moving to a new instance of DSS and when we test the same partitioning script we run into errors. The errors state that the space in the partition value for the spark recipe throws a log4j file now found exception. The root cause turns out to be that the folder name in the server user space and file is checked with the path containing Unicode (%20) instead of space " ". 

 

In both instances I am partitioning using Spark SQL script, and running on a Spark engine. The only difference between our old and new instance is that in our old instance the data is being saved to HDFS, where as in the new instance the data is being saved to S3. I wanted to ask if this is a limitation for dataset type, or something else.

 

For now, as a work around we can try adding an underscore, however, would be interested in understanding why this is happening. Any help on the above is greatly appreciated, thanks!

1 Solution
kathyqingyuxu
Level 2
Author

We were able to figure out what the root cause was on our end! If anyone is interested:

 

The root cause was coming from YARN, and how YARN's resource handler manages path names. Our old instance leverages a static MapR cluster. The MapR cluster leverages YARN, which by default, doesn't encode spaces to ASCII so when it inherits the partition name in the log4j.properties path, it correctly has spaces. However, in in our new instance we do not use YARN we are leveraging EKS which will silently escape spaces as an ASCII coding will result in a path that DNE and log4j.properties aren't read in properly.

 

In general, on our end as best practices we will encourage users to not use spaces in partitions to avoid any challenges in the future.

View solution in original post

2 Replies
kathyqingyuxu
Level 2
Author

We were able to figure out what the root cause was on our end! If anyone is interested:

 

The root cause was coming from YARN, and how YARN's resource handler manages path names. Our old instance leverages a static MapR cluster. The MapR cluster leverages YARN, which by default, doesn't encode spaces to ASCII so when it inherits the partition name in the log4j.properties path, it correctly has spaces. However, in in our new instance we do not use YARN we are leveraging EKS which will silently escape spaces as an ASCII coding will result in a path that DNE and log4j.properties aren't read in properly.

 

In general, on our end as best practices we will encourage users to not use spaces in partitions to avoid any challenges in the future.

View solution in original post

Ignacio_Toledo

Thanks for sharing the solution @kathyqingyuxu !

0 Kudos
A banner prompting to get Dataiku DSS