Change of storage types after building a python recipe

Jesus
Jesus Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8 ✭✭✭✭

Today, after finishing a python recipe i builded to filter some data by different characteristics, after i ran it, in the new dataset created, all columns with numerical values where changed from int to double, adding decimal (.0) on all values. The problem don`t end there, because after i implemented this new dataset on previous builded recipes i recive the next error( "OUTPUT_DATA_BAD_INT"), and after running this recipe some of the columns miss all their values (numerical columns with storage changed). I cannot change the storage type of the column after i run my python recipe, it seems like is fixed and give errors no matter how i try to change it.

Best Answer

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    edited July 2024 Answer ✓

    Hi,

    Thanks for the added context, especially the missing values part which is actually crucial here

    The undesired int -> float conversion is actually performed by pandas itself and not Dataiku. More specifically, when pandas encounters a numerical column with missing values, it automatically considers it as float (see the pandas documentation for more details).

    To circumvent that in DSS, you can add a few arguments to the get_dataframe() method used to read data from your input Dataset:

    input_df = input_dataset.get_dataframe(infer_with_pandas=False, bool_as_str=True)

    Hope this helps!

    Best,

    Harizo

Answers

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker

    Hi Jesus,

    To fix your Flow we need to start by investigating on the Python recipe.

    By default, the schema of its output Dataset is defined by the schema of the pandas DataFrame you want to write to it. That's what the write_with_schema() method of the dataiku.Dataset class does. If you suspect that at some point the type of your integer columns may have been changed to double-precision floats, I'd recommend inspecting the output DataFrame, e.g. by switching to the notebook edition mode of your recipe and looking at the output of your_dataframe.dtypes. From there you should make sure that all columns that need to remain in the integer format are marked as int64.

    Once you have validated that assertion, you should also set the dropAndCreate argument of the write_with_schema() method to True so that the faulty Dataset's output schema can be cleared before the build.

    From there, you should be able to re-build the downstream parts of your Flow without issues.

    Hope this helps,

    Best,

    Harizo

  • Jesus
    Jesus Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8 ✭✭✭✭

    Thank you for the help, sorry for not being specific or precise with my question. The problem was that after i ended my python code of the recipe the class of my integer columns changed to float (without my having done so and without the use of such columns throughout the code). I don't know why this happens, because in other python recipes I have designed the problem persists, and this results in data storage failures and later in the loss of all data in those columns due to inconsistencies in the flow. I could solved changing all the columns with numerical values to integer (previously filling all Nan gaps with numerical values, which is bit tedious if u got a lot of them, which luckily is not my case). Is there any other solution to avoid this problem, because i think is more some internal procedure of dataiku that use to convert numerical columns with Nan into double.

    Thank you for help

Setup Info
    Tags
      Help me…