The Dataiku Frontrunner Awards have just launched to recognize your achievements! Submit Your Entry

Prepare recipie parse format

Prepare recipie parse format

Hello everyone,

One functionality very useful for me and a lot of my colleagues will be to be able to do mass parse for the type of columns. Actually its possible for meanings :

capture1.PNG

But there is no impact on metadata in the output dataset. It's an issue for us because as we have poor completeness from the sources, with a lot of null values, we have often this reinterpretation in a prepare recipie (it seems the explorer only is used to create the schema and not all the dataset) :

Capture2.PNG

Don't hesitate to tell me if it's not clear, or if there is already a solution for this use case,

Thanks a lot,

6 Comments
tgb417
Neuron
Neuron

@Tuong-Vi 

I'm not clear that I understand this.

I know that recently I discovered the feature that would apply auto schema types on all columns.  (Which saves me a bunch of time.)

InferType.jpg

 That said this does have problems with high numbers of missing values.  Sometimes not getting enough data in the first 10,000 row data sample to find the correct type.  Sometimes the data is only on the most recent rows of data. (not coming in the first 10000 rows from the data source, maybe existing in only the last 10000 rows of data.)

Tuong-Vi
Neuron
Neuron

Thank you for your answer, to illustrate my use case, I have reproduced the issue with a mini flow :

flow.PNG

For file data_source, no problem with "infer type schema" or functionality set type to force data type :

capture.PNG

The source is ok with the format (all amounts = double), but when I create prepare recipie, some amounts are automatically reinterpreted as string :

prepare.PNG

 

I have tried defaulting but it doesn't work. In this step, the only way is manual change column by column and it's not simple with 50,100 or 300 amounts to manage...

This is a particular issue, but quite common for financial/accounting use case because null and 0 value can have distinct meaning. That's why the set type function (as in file source dataset) in a prepare recipie will be useful to manage mass action on data type.

Maybe, if I change the type of explorer, it will refresh the schema (but for me its not 100 % reliable) ?

explorer.PNG

 

tgb417
Neuron
Neuron

@Tuong-Vi 

As a Neuron, I suspect that you are on a recent version of DSS.

I have definitely seen some of the kinds of changes you are calling out in visual recipes.  Maybe not as bad as it was back when I started with DSS.  (Or maybe I understand how to deal with Schema problems a little bit better.)

Tuong-Vi
Neuron
Neuron

Hello,

We're using V8.0.2 (but I have already seen this issue with V5). At Generali, there are hive tables as sources so data formats (theoretically) are well managed before ingestion in dss.

I did some tests with local machine with postgree table and csv file to be sure I can reproduce this case. It make sense that the more i have completeness, the more schemas will stay stable accross dss flow. I guess null/blank values have to be treated by users before run dss flows.

I'm aware I can fix it quickly with cast in an sql recipie, but I'm always looking for a graphical way to deal with this subject, for our clickers/beginner end-users 🙂

(I have posted a similar issue there )

AshleyW
Dataiker
Dataiker
Status changed to: Needs Info

Hi, 

Thanks for the suggestion @Tuong-Vi; we've heard similar requests. If I've understood the thread, you're looking for a visual-friendly way to prevent the Prepare from inferring data type where you want it to always 'inherit' those types from the input dataset?

Ashley

AshleyW
Dataiker
Dataiker

Hi @Tuong-Vi , let me know if I've correctly understood your request. Thanks!