Schema changes from double to integer all the time.

Wuser92
Wuser92 Registered Posts: 20 ✭✭✭✭
I have a column with round floats (e.g. only 1.0, 100.0, etc.) and manually defined it as type "double" in a visual recipe. However, once the dataset is loaded into another visual recipe, the visual recipe automatically changes all the column types back to "integer" causing a failure everytime there appear a non-round float (e.g. 1.5).

How can one avoid that (visual) recipes automatically cast numbers to the format they think is appropriate, even though they don't know what data might run through them later on?

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Hi,

    This behavior cannot be disabled. However, if you have "1.0" it will detect as double, not integer - so I guess you actually have "1". In that case, you can go in the dataset settigns and make sure the "normalize doubles" parameter is set in the format params. This will ensure that "1" will appear as a proper "1.0" double and won't be integerified.
  • Wuser92
    Wuser92 Registered Posts: 20 ✭✭✭✭
    Thanks! How can one avoid that if you e.g. copy a string column that only contains numbers, that the copied column is also of type string and not casted to integer or float?
  • batchmeister
    batchmeister Registered Posts: 3 ✭✭✭✭
    I would also like to know the answer to this.
  • Damo
    Damo Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 9 Partner

    I'm also facing the same issue. any suggestions?

    E.g. Copy a string column that only contains numbers, that the copied column is also of type string and not casted to integer or float?

    For the time being, I've converted the field to string and concatenated required number of values (omitting .0 ) in the python recipe and loaded to output dataset.

  • Wido
    Wido Registered Posts: 2 ✭✭✭

    I'm also facing this issue in a python recipe. Although the column is typed as string (Object) and I defined the output schema to be string: Text, after running the recipe this column gets converted into numbers.

    The business issue here is that this column contains zero-padded values which we need to keep.

    So how can one define and keep the output format as defined?

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭

    Welcome @Wido
    !

    I have the same business issue with those zero-padded values, but have not encountered problems with keeping them intact. Different is that i don't (yet) use python recipes when handling those, so hopefully someone else can help you with that. For a visual-oriented solution there is an example below.

    Example : copying a zero-padded value to a new column . I use a visual prepare step, the formula-option and simply state strval("column_name") , the value is copies 'as is' to the new column including preceeding zero's.. When using val("column_name"), numval("column_name") or just column_name the padding gets stripped.

    As my coworkers have an Excel-mania i do add a dummy-prefixletter to those values before exporting data out of dataiku, to ensure that excel sees those values as a stringvalue and won't start stripping them.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Jurre
    ,

    I’ve not tried this. However, I was thinking about your export to excel question below. Rather than using a dummy letter have you thought of pretending a single quote. to the 0 prefixed number. Something like:

    Postal code
    ‘08904
    ‘07204


    MS Excel sees the single quote as a prefix that means that the following is text and should not be seen as a number. I don’t know if this will work from within Dataiku. But it might be worth a try.

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭

    Your suggestion is great, surely preferable over what i do with those prefixes now @tgb417
    Tom, i didn't know excel could be forced in this way. Sadly i can't use it : this prefixletter is a fixed and longstanding procedural thing to ensure everybody involved has a crystal clear picture of what a certain value represents. A dirty solution for a "between display and chair"-challenge so to speak.

  • Wido
    Wido Registered Posts: 2 ✭✭✭

    Thanks for your thoughts and suggestions. I have found a solution, that works for me.

    The template code for python recipe is not very helping here. The input and output schemas get overwritten using that code. What you need to do:

    input_df = dataset.get_dataframe(infer_with_pandas=False)

    By default pandas infers the schema and overwrites what you have defined as input.

    For writing the dataframe the template suggests to use
    new_dataset.write_with_schema(my_dataframe)

    However the following will use the schema as defined:
    new_dataset.write_dataframe(output_df)

    To be honest the documentation is very poor at this point. A lot if talk around the topic but no clear API specification. But it solves my issue now. Hopefully it can help others as well.

  • Diwei
    Diwei Registered Posts: 5 ✭✭✭✭

    Hi TOM, ' will remain in the cells in EXCEL, click each cell then enter, it will recognize it as TEXT, then remove '. Howeve, not possible to click and enter in each cell.

  • Diwei
    Diwei Registered Posts: 5 ✭✭✭✭
    edited July 17

    Could u please explain how you did that?, i have the error message. Thanks

    test = dataiku.Dataset("test")
    test.write_dataframe(test_df)

    Exception: An error occurred during dataset write (d2nJvJdtaK): RuntimeException: Forbidden state transition : ERROR -> STREAMING

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    I've run into the same problem V11.0.3.

    The dku.write_dataframe()

    Is still not keeping my variables stable.

    Right now I'm finding my variables changed back to strings. And then in a subsequent step when I try to do a join my types don't match.

    The underlying data source is a PostgreSQL database.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    I've discovered a similar problem.

    I'm using the API connect plugin run by a scenario. Sometime the plugin can not connect to the remote API. When this happens from time to time my column with the integer API response code is converted to Float rather than the typical Int. Causing the Recipe to fail. Not fully clear why this is occurring. But this is another example where uncontrolled "duck typing" is causing problems.

    cc: @AlexB

  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 106 Dataiker

    Hello all past & future readers of this post,

    I wanted to share with you all an exciting update we just released as part of V12 which should help with this frustration.

    In all DSS versions prior to V12, the default behavior is to infer column types for all dataset formats. V12 has a new default behavior for all new prepare recipes (existing recipes will not be changed), which is to infer data types for loosely-typed input dataset formats (e.g. CSV) and lock for strongly-typed ones (e.g. SQL, Parquet). We also now have an admin setting (Administration > Settings > Misc) in the UI to change this behavior if you so choose.

    See detail in our reference docs & release notes.

    Let me know if you have any questions!

    Katie

  • Noah
    Noah Registered Posts: 43 ✭✭✭✭

    how do I access this normalize doubles feature?

Setup Info
    Tags
      Help me…