Prepare recipie parse format

Hello everyone,

One functionality very useful for me and a lot of my colleagues will be to be able to do mass parse for the type of columns. Actually its possible for meanings :

capture1.PNG

But there is no impact on metadata in the output dataset. It's an issue for us because as we have poor completeness from the sources, with a lot of null values, we have often this reinterpretation in a prepare recipie (it seems the explorer only is used to create the schema and not all the dataset) :

Capture2.PNG

Don't hesitate to tell me if it's not clear, or if there is already a solution for this use case,

Thanks a lot,

10 Comments

@Tuong-Vi 

I'm not clear that I understand this.

I know that recently I discovered the feature that would apply auto schema types on all columns.  (Which saves me a bunch of time.)

InferType.jpg

 That said this does have problems with high numbers of missing values.  Sometimes not getting enough data in the first 10,000 row data sample to find the correct type.  Sometimes the data is only on the most recent rows of data. (not coming in the first 10000 rows from the data source, maybe existing in only the last 10000 rows of data.)

--Tom

@Tuong-Vi 

I'm not clear that I understand this.

I know that recently I discovered the feature that would apply auto schema types on all columns.  (Which saves me a bunch of time.)

InferType.jpg

 That said this does have problems with high numbers of missing values.  Sometimes not getting enough data in the first 10,000 row data sample to find the correct type.  Sometimes the data is only on the most recent rows of data. (not coming in the first 10000 rows from the data source, maybe existing in only the last 10000 rows of data.)

Tuong-Vi
Level 3

Thank you for your answer, to illustrate my use case, I have reproduced the issue with a mini flow :

flow.PNG

For file data_source, no problem with "infer type schema" or functionality set type to force data type :

capture.PNG

The source is ok with the format (all amounts = double), but when I create prepare recipie, some amounts are automatically reinterpreted as string :

prepare.PNG

 

I have tried defaulting but it doesn't work. In this step, the only way is manual change column by column and it's not simple with 50,100 or 300 amounts to manage...

This is a particular issue, but quite common for financial/accounting use case because null and 0 value can have distinct meaning. That's why the set type function (as in file source dataset) in a prepare recipie will be useful to manage mass action on data type.

Maybe, if I change the type of explorer, it will refresh the schema (but for me its not 100 % reliable) ?

explorer.PNG

 

Thank you for your answer, to illustrate my use case, I have reproduced the issue with a mini flow :

flow.PNG

For file data_source, no problem with "infer type schema" or functionality set type to force data type :

capture.PNG

The source is ok with the format (all amounts = double), but when I create prepare recipie, some amounts are automatically reinterpreted as string :

prepare.PNG

 

I have tried defaulting but it doesn't work. In this step, the only way is manual change column by column and it's not simple with 50,100 or 300 amounts to manage...

This is a particular issue, but quite common for financial/accounting use case because null and 0 value can have distinct meaning. That's why the set type function (as in file source dataset) in a prepare recipie will be useful to manage mass action on data type.

Maybe, if I change the type of explorer, it will refresh the schema (but for me its not 100 % reliable) ?

explorer.PNG

 

@Tuong-Vi 

As a Neuron, I suspect that you are on a recent version of DSS.

I have definitely seen some of the kinds of changes you are calling out in visual recipes.  Maybe not as bad as it was back when I started with DSS.  (Or maybe I understand how to deal with Schema problems a little bit better.)

--Tom

@Tuong-Vi 

As a Neuron, I suspect that you are on a recent version of DSS.

I have definitely seen some of the kinds of changes you are calling out in visual recipes.  Maybe not as bad as it was back when I started with DSS.  (Or maybe I understand how to deal with Schema problems a little bit better.)

Tuong-Vi
Level 3

Hello,

We're using V8.0.2 (but I have already seen this issue with V5). At Generali, there are hive tables as sources so data formats (theoretically) are well managed before ingestion in dss.

I did some tests with local machine with postgree table and csv file to be sure I can reproduce this case. It make sense that the more i have completeness, the more schemas will stay stable accross dss flow. I guess null/blank values have to be treated by users before run dss flows.

I'm aware I can fix it quickly with cast in an sql recipie, but I'm always looking for a graphical way to deal with this subject, for our clickers/beginner end-users ๐Ÿ™‚

(I have posted a similar issue there )

Hello,

We're using V8.0.2 (but I have already seen this issue with V5). At Generali, there are hive tables as sources so data formats (theoretically) are well managed before ingestion in dss.

I did some tests with local machine with postgree table and csv file to be sure I can reproduce this case. It make sense that the more i have completeness, the more schemas will stay stable accross dss flow. I guess null/blank values have to be treated by users before run dss flows.

I'm aware I can fix it quickly with cast in an sql recipie, but I'm always looking for a graphical way to deal with this subject, for our clickers/beginner end-users ๐Ÿ™‚

(I have posted a similar issue there )

AshleyW
Dataiker

Hi, 

Thanks for the suggestion @Tuong-Vi; we've heard similar requests. If I've understood the thread, you're looking for a visual-friendly way to prevent the Prepare from inferring data type where you want it to always 'inherit' those types from the input dataset?

Ashley

Status changed to: Gathering Input

Hi, 

Thanks for the suggestion @Tuong-Vi; we've heard similar requests. If I've understood the thread, you're looking for a visual-friendly way to prevent the Prepare from inferring data type where you want it to always 'inherit' those types from the input dataset?

Ashley

AshleyW
Dataiker

Hi @Tuong-Vi , let me know if I've correctly understood your request. Thanks!

Hi @Tuong-Vi , let me know if I've correctly understood your request. Thanks!

ecathell
Level 1

@AshleyW It would be great if dataiku would respect our type and meaning choices through an entire pipeline. We've been quite frustrated at DataIku's insistence on changing text fields to booleans, decimal fields to scientific notation, and the need to reparse dates every time they go through a prepare recipe.

@AshleyW It would be great if dataiku would respect our type and meaning choices through an entire pipeline. We've been quite frustrated at DataIku's insistence on changing text fields to booleans, decimal fields to scientific notation, and the need to reparse dates every time they go through a prepare recipe.

Aureltito
Level 1

Hi,

I agree. I don't understand why Dataiku is changing format from Double to String after a Prepare recipe. I checked and every row is fulfilled with an integer. 

Prepare Recipe changing format.PNG

So why Dataiku is changing this format? How can we avoid this behaviour?

Best regards,

Hi,

I agree. I don't understand why Dataiku is changing format from Double to String after a Prepare recipe. I checked and every row is fulfilled with an integer. 

Prepare Recipe changing format.PNG

So why Dataiku is changing this format? How can we avoid this behaviour?

Best regards,

MichaelG
Community Manager
Community Manager
 
I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
Status changed to: Gathering Input
 
ktgross15
Dataiker

Hello all!

I wanted to share an exciting update we just released as part of V12 which should relieve this frustration. 

In all DSS versions prior to V12, the default behavior is to infer column types for all dataset formats (as is probably obvious from the conversation above).

V12 has a new default behavior for all new prepare recipes (existing recipes will not be changed), which is to infer data types for loosely-typed input dataset formats only (e.g. CSV) and lock for strongly-typed ones (e.g. SQL, Parquet). This means that storage types will not change for SQL or parquet files anymore.

We also now have an admin setting (Administration > Settings > Misc) in the UI to change this behavior if you so choose.

See detail in our reference docsrelease notes.

Let me know if you have any questions!

Katie

Status changed to: Released

Hello all!

I wanted to share an exciting update we just released as part of V12 which should relieve this frustration. 

In all DSS versions prior to V12, the default behavior is to infer column types for all dataset formats (as is probably obvious from the conversation above).

V12 has a new default behavior for all new prepare recipes (existing recipes will not be changed), which is to infer data types for loosely-typed input dataset formats only (e.g. CSV) and lock for strongly-typed ones (e.g. SQL, Parquet). This means that storage types will not change for SQL or parquet files anymore.

We also now have an admin setting (Administration > Settings > Misc) in the UI to change this behavior if you so choose.

See detail in our reference docsrelease notes.

Let me know if you have any questions!

Katie