When partitioning a dataset with the "Redispatch" option, do not remove partitioning dimension(s)

Tanguy
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron

NOTE: this post deals with files-based partitioning

For some reason, dataiku removes the partitioning dimension(s) when a dataset is (files-based) partitioned using a sync recipe.

See for example this hands-on tutorial: "Dataiku DSS warns that a schema update is required. This is because redispatching removes the purchase_date column [= partitioning dimension] when our dataset is stored on a file system".

This behaviour can be annoying as it prevents:

  1. explicitly visualizing the partitions in the explore tab
  2. performing computations on the partitioning column (e.g, for time-based partititions, one could need to compute the time elapsed since the partition date)
  3. retrieving the partitioning dimension later in the flow when switching to an unpartitioned format (at least with Spark SQL, I noticed that AWS Athena - which is only recommended to query datasets - does retrive the partitioning dimension)

An easy work-around consists in duplicating the partitioning column(s) by renaming them, but this sounds like code smell.

6
6 votes

In the Backlog · Last Updated

Comments

  • June
    June Dataiku DSS Core Designer, Registered Posts: 20 ✭✭✭✭

    I upvoted this too. It is this behavior with partitions in the local filesystem that make me always use my own SQL DB instead. When you partition a SQL dataset you can still see the partitions in the table.

  • Elie
    Elie Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Product Ideas Manager Posts: 32 Dataiker

    Thanks for your idea @tanguy
    . Your idea meets the criteria for submission, we'll reach out should we require more information.

    If you’re reading this and think this would be a great capability to add to DSS, be sure to kudos the original post!

    Take care

  • Elie
    Elie Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Product Ideas Manager Posts: 32 Dataiker

    Thanks for submitting this idea and for sharing the context around why it would be useful to your team. You'll be pleased to hear this idea is in our backlog. It is a request we've received from customers--and we are determining the next steps for development. We can't provide a timeline at this point, but be sure to check back for updates.! For everyone else, kudos the original post to signal that you're interested in Dataiku developing and releasing this feature!

    Take care

Setup Info
    Tags
      Help me…