When partitioning a dataset with the "Redispatch" option, do not remove partitioning dimension(s)

Tanguy · October 2022

NOTE: this post deals with files-based partitioning

For some reason, dataiku removes the partitioning dimension(s) when a dataset is (files-based) partitioned using a sync recipe.

See for example this hands-on tutorial: "Dataiku DSS warns that a schema update is required. This is because redispatching removes the purchase_date column [= partitioning dimension] when our dataset is stored on a file system".

This behaviour can be annoying as it prevents:

explicitly visualizing the partitions in the explore tab
performing computations on the partitioning column (e.g, for time-based partititions, one could need to compute the time elapsed since the partition date)
retrieving the partitioning dimension later in the flow when switching to an unpartitioned format (at least with Spark SQL, I noticed that AWS Athena - which is only recommended to query datasets - does retrive the partitioning dimension)

An easy work-around consists in duplicating the partitioning column(s) by renaming them, but this sounds like code smell.

June · October 2022

I upvoted this too. It is this behavior with partitions in the local filesystem that make me always use my own SQL DB instead. When you partition a SQL dataset you can still see the partitions in the table.

Elie · November 2022

Thanks for your idea @tanguy
. Your idea meets the criteria for submission, we'll reach out should we require more information.

If you’re reading this and think this would be a great capability to add to DSS, be sure to kudos the original post!

Take care

Elie · November 2022

Thanks for submitting this idea and for sharing the context around why it would be useful to your team. You'll be pleased to hear this idea is in our backlog. It is a request we've received from customers--and we are determining the next steps for development. We can't provide a timeline at this point, but be sure to check back for updates.! For everyone else, kudos the original post to signal that you're interested in Dataiku developing and releasing this feature!

Take care

When partitioning a dataset with the "Redispatch" option, do not remove partitioning dimension(s)

In the Backlog · Last Updated October 2022

Comments

Categories

Setup Info

Tags