Forcing dtype with get_dataframe

Sv3n-Sk4
Level 3
Forcing dtype with get_dataframe

Hello everyone,

I am encoutering a problem with the get_dataframe() method when using a python recipe.

Using :

 

dataset = dataiku.Dataset("my_dataset")
dataset_df = dataset.get_dataframe()

 

Works as intended and as infer_with_pandas is True by default : the engine use the types detected by pandas rather than the dataset schema as detected in DSS

However I need it to use the dataset schema insted. 

Then I used infer_with_pandas = False, but it is raising an error :

 

ValueError: Integer column has NA values in column

As I wish to create a function able to be used for several different datasets for automation I can't just handle column by column for each dataset.

I think to force all columns of my dataset as string would solve this problem but wasn't to find any solution.

If anyone can help, it would be very appreciated.

Thanks

0 Kudos
6 Replies
konathan
Level 3

Hi,

 

What  is the data type of this specific column that you've mentioned in the post in the input dataset of the recipe? Is it Integer? If yes, then you need to manually change the type in the Explore tab of this dataset and use the infer_with_pandas = False inside the recipe. You also need to check whether the data type has changed in previous steps of your pipeline probably due to other recipes. I hope that helps!

 

-Konstantina

0 Kudos
Sv3n-Sk4
Level 3
Author

Hello thanks for your answer.

It is indeed an integer, and I used infer_with_pandas = False in my python recipe.

But as I said, i need to automate it. Any manually actions would not correspond to my need.

0 Kudos
konathan
Level 3

Is this column created/renamed through a visual recipe at some point in your flow? You might find this article useful.

0 Kudos
Sv3n-Sk4
Level 3
Author

In some projects, it is, in other ones it's not!

As I need something replicable everywhere I need to find a global solution.

Thanks for your link !

0 Kudos
AsishM
Level 2

I agree that Dataiku needs to add better support when creating pandas dataframes from datasets. Maybe a simpler fix for Dataiku - could be to enable the use of pd.convert_dtypes (and use the nullable extension type columns) via a checkbox or a parameter in the dataiku.Dataset.get_dataframe method.

A workaround for you might be to use dataiku.Dataset.read_schema first and then make the appropriate nullable data type conversions

0 Kudos
Sv3n-Sk4
Level 3
Author

Thanks for your answer!

I am totally agree with you and think a convert_dtypes would be handly.

I tried with read_schema but didn't find anyway to automate the conversion of the types not corresponding to my expected goals.

0 Kudos

Labels

?
Labels (3)
A banner prompting to get Dataiku