I used to be able to partition with spaces in the identifier however now am not in a new instance

kathyqingyuxu Neuron, Registered, Neuron 2022 Posts: 46 Neuron


I have a question on partitioning. I used to be able to partition with partition identifiers that have a space. Example: "Bob Smith, Anne Xu". However, we are moving to a new instance of DSS and when we test the same partitioning script we run into errors. The errors state that the space in the partition value for the spark recipe throws a log4j file now found exception. The root cause turns out to be that the folder name in the server user space and file is checked with the path containing Unicode (%20) instead of space " ".

In both instances I am partitioning using Spark SQL script, and running on a Spark engine. The only difference between our old and new instance is that in our old instance the data is being saved to HDFS, where as in the new instance the data is being saved to S3. I wanted to ask if this is a limitation for dataset type, or something else.

For now, as a work around we can try adding an underscore, however, would be interested in understanding why this is happening. Any help on the above is greatly appreciated, thanks!

Best Answer

  • kathyqingyuxu
    kathyqingyuxu Neuron, Registered, Neuron 2022 Posts: 46 Neuron
    Answer ✓

    We were able to figure out what the root cause was on our end! If anyone is interested:

    The root cause was coming from YARN, and how YARN's resource handler manages path names. Our old instance leverages a static MapR cluster. The MapR cluster leverages YARN, which by default, doesn't encode spaces to ASCII so when it inherits the partition name in the log4j.properties path, it correctly has spaces. However, in in our new instance we do not use YARN we are leveraging EKS which will silently escape spaces as an ASCII coding will result in a path that DNE and log4j.properties aren't read in properly.

    In general, on our end as best practices we will encourage users to not use spaces in partitions to avoid any challenges in the future.


  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron

    Thanks for sharing the solution @kathyqingyuxu

Setup Info
      Help me…