Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello everyone,
I am encoutering a problem with the get_dataframe() method when using a python recipe.
Using :
dataset = dataiku.Dataset("my_dataset")
dataset_df = dataset.get_dataframe()
Works as intended and as infer_with_pandas is True by default : the engine use the types detected by pandas rather than the dataset schema as detected in DSS
However I need it to use the dataset schema insted.
Then I used infer_with_pandas = False, but it is raising an error :
ValueError: Integer column has NA values in column
As I wish to create a function able to be used for several different datasets for automation I can't just handle column by column for each dataset.
I think to force all columns of my dataset as string would solve this problem but wasn't to find any solution.
If anyone can help, it would be very appreciated.
Thanks
Hi,
What is the data type of this specific column that you've mentioned in the post in the input dataset of the recipe? Is it Integer? If yes, then you need to manually change the type in the Explore tab of this dataset and use the infer_with_pandas = False inside the recipe. You also need to check whether the data type has changed in previous steps of your pipeline probably due to other recipes. I hope that helps!
-Konstantina
Hello thanks for your answer.
It is indeed an integer, and I used infer_with_pandas = False in my python recipe.
But as I said, i need to automate it. Any manually actions would not correspond to my need.
Is this column created/renamed through a visual recipe at some point in your flow? You might find this article useful.
In some projects, it is, in other ones it's not!
As I need something replicable everywhere I need to find a global solution.
Thanks for your link !
I agree that Dataiku needs to add better support when creating pandas dataframes from datasets. Maybe a simpler fix for Dataiku - could be to enable the use of pd.convert_dtypes (and use the nullable extension type columns) via a checkbox or a parameter in the dataiku.Dataset.get_dataframe method.
A workaround for you might be to use dataiku.Dataset.read_schema first and then make the appropriate nullable data type conversions
Thanks for your answer!
I am totally agree with you and think a convert_dtypes would be handly.
I tried with read_schema but didn't find anyway to automate the conversion of the types not corresponding to my expected goals.