Recipes should not reinterpret storage types or meanings

Description

When applying a Prepare recipe to a dataset, Dataiku will often reinterpret storage types and meanings. Note that I am not referring to Dataiku guessing the storage type or meaning for a derived column created within the Prepare recipe. I am referring to existing columns, which were often previously configured with the correct storage type and meaning, being changed. 

Impact

This has caused hours of frustration, rework, troubleshooting, and extra processing across our teams. Not all users are aware of this behavior or remember it in the moment, causing them to hunt down an issue that Dataiku created for them. For those that are aware of this behavior, it is incredibly frustrating to configure your dataset correctly at the start of a project and have to re-check columns at various points to ensure Dataiku hasn't second guessed your choices.

Suggested Fix

I believe there should be an environment and project level setting to disable this behavior.  If  that is not a viable option, there should minimally be the ability to "lock" a storage type and meaning so that Dataiku will leave it as it has been set. Implementing all three would obviously give the most flexibility, but my preference would be a simple switch at the environment and project levels. While I'm mentioning the Prepare recipe here, I remember seeing documentation (which I cannot find now) stating that at least one other recipe has the same behavior. The implemented fix should work regardless of the recipe.

7 Comments
info-rchitect
Level 6

Absolutely agree, I have discontinued nearly all use of visual recipes because of this exact behavior.

Absolutely agree, I have discontinued nearly all use of visual recipes because of this exact behavior.

ktgross15
Dataiker

Thank you for your feedback, this is in our backlog and we're investigating potential options here.

I will share any relevant updates here.

Status changed to: In Backlog

Thank you for your feedback, this is in our backlog and we're investigating potential options here.

I will share any relevant updates here.

I am curious in what scenarios this happens. Does it happen on Visual recipes? I haven't seen amy issues around data types, but we mostly use Python recipes. Will setting the meaning prevent any changes? (Dataiku adds a Lock icon when you set the meaning)

I am curious in what scenarios this happens. Does it happen on Visual recipes? I haven't seen amy issues around data types, but we mostly use Python recipes. Will setting the meaning prevent any changes? (Dataiku adds a Lock icon when you set the meaning)

justindavis_apd
Level 1

Could not agree more. This is incredibly frustrating behavior, and as mentioned above it causes re-development! To answer @Turribeach's question above, in our experience it occurs primarily with visual recipes.

 

In my experience it is not as much of an issue if you decide up front to lock the meaning of the field (ironically, not the storage type, which I would expect to be the primary way to resolve).

Could not agree more. This is incredibly frustrating behavior, and as mentioned above it causes re-development! To answer @Turribeach's question above, in our experience it occurs primarily with visual recipes.

 

In my experience it is not as much of an issue if you decide up front to lock the meaning of the field (ironically, not the storage type, which I would expect to be the primary way to resolve).

I understand the issue here but the existing behaviour is clearly by design (eg it's not a bug, its a feature). If this Idea existed as a product feature and you enabled it in your project datasets you will then have to manually manage every data type change through your whole flow. Will this not defeat the purpose of using a flexible data prep tool like Dataiku? Being able to propagate schema changes quickly is a key enabler for faster development of data pipelines.

If the issue is really that you want to know when a data type is changed so this doesn't happen without you knowing wouldn't something like a custom Python metric that is looking at the dataset data types and triggers a check be a better approach? Thanks

I understand the issue here but the existing behaviour is clearly by design (eg it's not a bug, its a feature). If this Idea existed as a product feature and you enabled it in your project datasets you will then have to manually manage every data type change through your whole flow. Will this not defeat the purpose of using a flexible data prep tool like Dataiku? Being able to propagate schema changes quickly is a key enabler for faster development of data pipelines.

If the issue is really that you want to know when a data type is changed so this doesn't happen without you knowing wouldn't something like a custom Python metric that is looking at the dataset data types and triggers a check be a better approach? Thanks

ktgross15
Dataiker

To provide a bit more color here..

@Turribeach - this is specific only to the prepare recipe - you can see detail on type inference here

To explain a bit of the "why": it's by design in order to allow the prepare recipe to support such a vast array of transformations across different input data sources. Some data sources such as Excel don't have data types, so when you import them into DSS they're all strings by default. So, if you were to -- for example -- add a formula step to add numeric columns colA and colB, with values 3 and 2 respectively, if there were no type inference, you'd get 3+2 = 32 (instead of 5), since these columns would be treated as strings.

Most of the time, the type inference is a helpful feature to allow the prepare recipe to work as it does, but sometimes you may end up with unexpected data type changes, and I 100% hear the frustration here. 

It is something we're looking into, but given the nature of how embedded it is in the DNA of the prepare recipe, it requires extensive exploration to ensure that this is a safe change that doesn't introduce bugs, etc. 

We will let you know if there are any relevant updates to share, and thanks for all the feedback! ๐Ÿ™‚ 

To provide a bit more color here..

@Turribeach - this is specific only to the prepare recipe - you can see detail on type inference here

To explain a bit of the "why": it's by design in order to allow the prepare recipe to support such a vast array of transformations across different input data sources. Some data sources such as Excel don't have data types, so when you import them into DSS they're all strings by default. So, if you were to -- for example -- add a formula step to add numeric columns colA and colB, with values 3 and 2 respectively, if there were no type inference, you'd get 3+2 = 32 (instead of 5), since these columns would be treated as strings.

Most of the time, the type inference is a helpful feature to allow the prepare recipe to work as it does, but sometimes you may end up with unexpected data type changes, and I 100% hear the frustration here. 

It is something we're looking into, but given the nature of how embedded it is in the DNA of the prepare recipe, it requires extensive exploration to ensure that this is a safe change that doesn't introduce bugs, etc. 

We will let you know if there are any relevant updates to share, and thanks for all the feedback! ๐Ÿ™‚ 

MichaelG
Community Manager
Community Manager
 
I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
Status changed to: In the Backlog