get_dataset loading strings as floats
I have a dataset with US Zip Codes in it, which are obviously very similar to integers. I need to do some processing in python on them, and have built a notebook to do so. However when I call:
my_dataset= dataiku.Dataset("my_dataset")
my_dataset_df = my_dataset_df.get_dataframe()
I find that sometimes my Zip Codes get interpreted as floats and if I iterate over them (e.g. trying to find count of each value) I get 81819 and 81819.0 in my list, which is not desired.
I tried:
my_dataset= dataiku.Dataset("my_dataset")
my_dataset_df = my_dataset_df.get_dataframe(infer_with_pandas=False)
Which would force it to use the Dataiku column types, however I have some other Integer columns in my data with NULL values. I could put a dummy value in them (e.g. 0 or -1), but I would prefer not to, the NULL value accurately reflects that I don't know what that value is (duration between two events and in some case events have not happened).
I tried checking the docs for some way to force the type for a way to force the type of a column on load. I think I could use iter_dataframes_forced_types to read in the column with a forced type, but then I don't see an easy way to write the updated column back.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Hi, please post code snippets isung code blocks (the </> icon) so that the code is properly formatted and can be copy/pasted because as you know Python enforces strict indentation so without a code block the indentation is lost and makes it harder for people trying to reproduce your issue to use your code snippets. Also please make sure your code actually works, both of your code snippets are incorrect:
my_dataset = dataiku.Dataset("my_dataset") my_dataset_df = my_dataset.get_dataframe()
In your snippet you used my_dataset_df.get_dataframe() which doesn't exist. The correct call is my_dataset.get_dataframe().
So now to your question. Irrespectively of you deciding to use infer_with_pandas=False or not, the key to your problem is to make sure that you:
- Use the drop_and_create=True in your write method call, ie ds_output.write_with_schema(df_input, True) to make sure Dataset's output schema can be cleared before the build
- Set the relevant data types in your data frame and deal with any nulls as required before writting
With regards to nulls in integers in version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values: https://stackoverflow.com/a/54194908/10491951
Here is more information how to write just the schema or just the data in the output dataset iin case you want to do things in a different way.
https://doc.dataiku.com/dss/latest/code_recipes/python.html#writing-the-output-schema
PS: The undesired string -> float conversion is actually performed by pandas itself and not Dataiku.