Forcing dtype with get_dataframe

Sv3n-Sk4 · ‎11-30-2023

Hello everyone,

I am encoutering a problem with the get_dataframe() method when using a python recipe.

Using :

dataset = dataiku.Dataset("my_dataset")
dataset_df = dataset.get_dataframe()

Works as intended and as infer_with_pandas is True by default : the engine use the types detected by pandas rather than the dataset schema as detected in DSS

However I need it to use the dataset schema insted.

Then I used infer_with_pandas = False, but it is raising an error :

ValueError: Integer column has NA values in column

As I wish to create a function able to be used for several different datasets for automation I can't just handle column by column for each dataset.

I think to force all columns of my dataset as string would solve this problem but wasn't to find any solution.

If anyone can help, it would be very appreciated.

Thanks

konathan · ‎11-30-2023

Hi,

What is the data type of this specific column that you've mentioned in the post in the input dataset of the recipe? Is it Integer? If yes, then you need to manually change the type in the Explore tab of this dataset and use the infer_with_pandas = False inside the recipe. You also need to check whether the data type has changed in previous steps of your pipeline probably due to other recipes. I hope that helps!

-Konstantina

Sv3n-Sk4 · ‎12-01-2023

Hello thanks for your answer.

It is indeed an integer, and I used infer_with_pandas = False in my python recipe.

But as I said, i need to automate it. Any manually actions would not correspond to my need.

konathan · ‎12-01-2023

Is this column created/renamed through a visual recipe at some point in your flow? You might find this article useful.

Sv3n-Sk4 · ‎12-01-2023

In some projects, it is, in other ones it's not!

As I need something replicable everywhere I need to find a global solution.

Thanks for your link !

AsishM · ‎12-01-2023

I agree that Dataiku needs to add better support when creating pandas dataframes from datasets. Maybe a simpler fix for Dataiku - could be to enable the use of pd.convert_dtypes (and use the nullable extension type columns) via a checkbox or a parameter in the dataiku.Dataset.get_dataframe method.

A workaround for you might be to use dataiku.Dataset.read_schema first and then make the appropriate nullable data type conversions

Sv3n-Sk4 · ‎12-01-2023

Thanks for your answer!

I am totally agree with you and think a convert_dtypes would be handly.

I tried with read_schema but didn't find anyway to automate the conversion of the types not corresponding to my expected goals.

Sign up to take part

Forcing dtype with get_dataframe

Forcing dtype with get_dataframe

Labels