DataIku messes up storage types in visual recipe.

Thomas_K
Thomas_K Registered Posts: 15 ✭✭✭✭

Basically, I want DataIku to stop changing storage types just because it thinks it knows better than me. These people seem to have the same problem.

I have large tables ~30 millions of lines. For some columns, the underlying column type is string because even though almost all of the rows - including all in the sample - are numeric, the definition in the database documentation is string. In rare cases, there is actually a letter in there, crashing my recipe. I know that these columns contain strings, and I don't want DataIku to convert them to bigint.

How do I stop DataIku from doing this without manually changing the column type? I am looking for a per-project global setting, since with basically every visual recipe I am using.

IMO, optimally, DataIku should never do this by itself - it can't know what the table will hold in the future, and any users that have no idea about data storage types will be confused as to what is wrong with their recipe. Instead, suggest it to the user with a nice explanation and let him manually approve of the change. It's better to waste a bit of storage space and compute power than to create potentially hard-to-detect problems by secretly converting data types.

Answers

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    In the visual preparation, you can have arbitrary manipulation of data (search & replace, formula, python code…), which is why DSS has to do type inference. Other visual recipes can compute the actual schema based on the resulting type of what is configured in the recipe, but there is no simple solution for visual preparation.

    If the column view, you also have mass actions on columns, including setting the column type. You have the same kind of tool in the dataset's schema screen. That is admittedly manual, but faster than doing it column by column.

    For an automated solution, using the public API or the internal python API, you can make a simple script that sets the string type for all columns of a given dataset, and package it in a macro for example. Then when you edit your visual preparation recipe, if it warns you that the output schema is not the same as the inferred schema, you can click Ignore so that it doesn't override the output dataset's schema, or re-run your macro afterwards before running the recipe.
  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 110 Dataiker

    Hello all future readers of this post!

    I wanted to share an exciting update we just released as part of V12 which should help with this frustration.

    In all DSS versions prior to V12, the default behavior is to infer column types for all dataset formats. V12 has a new default behavior for all new prepare recipes (existing recipes will not be changed), which is to infer data types for loosely-typed input dataset formats (e.g. CSV) and lock for strongly-typed ones (e.g. SQL, Parquet). We also now have an admin setting (Administration > Settings > Misc) in the UI to change this behavior if you so choose.

    See detail in our reference docs & release notes.

    Katie

Setup Info
    Tags
      Help me…