Prepare recipie parse format

Tuong-Vi
Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

Hello everyone,

One functionality very useful for me and a lot of my colleagues will be to be able to do mass parse for the type of columns. Actually its possible for meanings :

capture1.PNG

But there is no impact on metadata in the output dataset. It's an issue for us because as we have poor completeness from the sources, with a lot of null values, we have often this reinterpretation in a prepare recipie (it seems the explorer only is used to create the schema and not all the dataset) :

Capture2.PNG

Don't hesitate to tell me if it's not clear, or if there is already a solution for this use case,

Thanks a lot,

1
1 votes

Released · Last Updated

Comments

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Tuong-Vi

    I'm not clear that I understand this.

    I know that recently I discovered the feature that would apply auto schema types on all columns. (Which saves me a bunch of time.)

    InferType.jpg

    That said this does have problems with high numbers of missing values. Sometimes not getting enough data in the first 10,000 row data sample to find the correct type. Sometimes the data is only on the most recent rows of data. (not coming in the first 10000 rows from the data source, maybe existing in only the last 10000 rows of data.)

  • Tuong-Vi
    Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

    Thank you for your answer, to illustrate my use case, I have reproduced the issue with a mini flow :

    flow.PNG

    For file data_source, no problem with "infer type schema" or functionality set type to force data type :

    capture.PNG

    The source is ok with the format (all amounts = double), but when I create prepare recipie, some amounts are automatically reinterpreted as string :

    prepare.PNG

    I have tried defaulting but it doesn't work. In this step, the only way is manual change column by column and it's not simple with 50,100 or 300 amounts to manage...

    This is a particular issue, but quite common for financial/accounting use case because null and 0 value can have distinct meaning. That's why the set type function (as in file source dataset) in a prepare recipie will be useful to manage mass action on data type.

    Maybe, if I change the type of explorer, it will refresh the schema (but for me its not 100 % reliable) ?

    explorer.PNG

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Tuong-Vi

    As a Neuron, I suspect that you are on a recent version of DSS.

    I have definitely seen some of the kinds of changes you are calling out in visual recipes. Maybe not as bad as it was back when I started with DSS. (Or maybe I understand how to deal with Schema problems a little bit better.)

  • Tuong-Vi
    Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

    Hello,

    We're using V8.0.2 (but I have already seen this issue with V5). At Generali, there are hive tables as sources so data formats (theoretically) are well managed before ingestion in dss.

    I did some tests with local machine with postgree table and csv file to be sure I can reproduce this case. It make sense that the more i have completeness, the more schemas will stay stable accross dss flow. I guess null/blank values have to be treated by users before run dss flows.

    I'm aware I can fix it quickly with cast in an sql recipie, but I'm always looking for a graphical way to deal with this subject, for our clickers/beginner end-users

    (I have posted a similar issue there )

  • Ashley
    Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 163 Dataiker

    Hi,

    Thanks for the suggestion @Tuong-Vi
    ; we've heard similar requests. If I've understood the thread, you're looking for a visual-friendly way to prevent the Prepare from inferring data type where you want it to always 'inherit' those types from the input dataset?

    Ashley

  • Ashley
    Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 163 Dataiker

    Hi @Tuong-Vi
    , let me know if I've correctly understood your request. Thanks!

  • ecathell
    ecathell Dataiku DSS Core Designer, Registered Posts: 2 ✭✭✭

    @AshleyW
    It would be great if dataiku would respect our type and meaning choices through an entire pipeline. We've been quite frustrated at DataIku's insistence on changing text fields to booleans, decimal fields to scientific notation, and the need to reparse dates every time they go through a prepare recipe.

  • Aureltito
    Aureltito Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 1

    Hi,

    I agree. I don't understand why Dataiku is changing format from Double to String after a Prepare recipe. I checked and every row is fulfilled with an integer.

    Prepare Recipe changing format.PNG

    So why Dataiku is changing this format? How can we avoid this behaviour?

    Best regards,

  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 110 Dataiker

    Hello all!

    I wanted to share an exciting update we just released as part of V12 which should relieve this frustration.

    In all DSS versions prior to V12, the default behavior is to infer column types for all dataset formats (as is probably obvious from the conversation above).

    V12 has a new default behavior for all new prepare recipes (existing recipes will not be changed), which is to infer data types for loosely-typed input dataset formats only (e.g. CSV) and lock for strongly-typed ones (e.g. SQL, Parquet). This means that storage types will not change for SQL or parquet files anymore.

    We also now have an admin setting (Administration > Settings > Misc) in the UI to change this behavior if you so choose.

    See detail in our reference docs & release notes.

    Let me know if you have any questions!

    Katie

Setup Info
    Tags
      Help me…