You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Change of storage types after building a python recipe

Solved!
Jesus
Level 2
Change of storage types after building a python recipe

Today, after finishing a python recipe i builded to filter some data by different characteristics, after i ran it, in the new dataset created, all columns with numerical values where changed from int to double, adding decimal (.0) on all values. The problem don`t end there, because after i implemented this  new dataset on previous builded recipes i recive the next error( "OUTPUT_DATA_BAD_INT"), and after running this recipe some of the columns miss all their values (numerical columns with storage changed). I cannot change the storage type of the column after i run my python recipe, it seems like is fixed and give errors no matter how i try to change it.

0 Kudos
1 Solution
HarizoR
Dataiker
Dataiker

Hi,

Thanks for the added context, especially the missing values part which is actually crucial here 🙂

The undesired int -> float conversion is actually performed by pandas itself and not Dataiku. More specifically, when pandas encounters a numerical column with missing values, it automatically considers it as float (see the pandas documentation for more details).

To circumvent that in DSS, you can add a few arguments to the get_dataframe() method used to read data from your input Dataset:

input_df = input_dataset.get_dataframe(infer_with_pandas=False, bool_as_str=True)

 

Hope this helps!

Best,

Harizo 

View solution in original post

3 Replies
HarizoR
Dataiker
Dataiker

Hi Jesus,

To fix your Flow we need to start by investigating on the Python recipe. 

By default, the schema of its output Dataset is defined by the schema of the pandas DataFrame you want to write to it. That's what the write_with_schema() method of the dataiku.Dataset class does. If you suspect that at some point the type of your integer columns may have been changed to double-precision floats, I'd recommend inspecting the output DataFrame, e.g. by switching to the notebook edition mode of your recipe and looking at the output of your_dataframe.dtypes. From there you should make sure that all columns that need to remain in the integer format are marked as int64

Once you have validated that assertion, you should also set the dropAndCreate argument of the write_with_schema() method to True so that the faulty Dataset's output schema can be cleared before the build.

From there, you should be able to re-build the downstream parts of your Flow without issues.

Hope this helps,

Best,

Harizo

 

0 Kudos
Jesus
Level 2
Author

Thank you for the help, sorry for not being specific or precise with my question. The problem was that after i ended my python code of the recipe the class of my integer columns changed to float (without my having done so and without the use of such columns throughout the code). I don't know why this happens, because in other python recipes I have designed the problem persists, and this results in data storage failures and later in the loss of all data in those columns due to inconsistencies in the flow. I could solved changing all the columns with numerical values to integer (previously filling all Nan gaps with numerical values, which is bit tedious if u got a lot of them, which luckily is not my case). Is there any other solution to avoid this problem, because i think is more some internal procedure of dataiku that use to convert numerical columns with Nan into double.

Thank you for help

0 Kudos
HarizoR
Dataiker
Dataiker

Hi,

Thanks for the added context, especially the missing values part which is actually crucial here 🙂

The undesired int -> float conversion is actually performed by pandas itself and not Dataiku. More specifically, when pandas encounters a numerical column with missing values, it automatically considers it as float (see the pandas documentation for more details).

To circumvent that in DSS, you can add a few arguments to the get_dataframe() method used to read data from your input Dataset:

input_df = input_dataset.get_dataframe(infer_with_pandas=False, bool_as_str=True)

 

Hope this helps!

Best,

Harizo