Forcing dtype with get_dataframe

Options
Sv3n-Sk4
Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
edited July 16 in Using Dataiku

Hello everyone,

I am encoutering a problem with the get_dataframe() method when using a python recipe.

Using :

ValueError: Integer column has NA values in column

Works as intended and as infer_with_pandas is True by default : the engine use the types detected by pandas rather than the dataset schema as detected in DSS

However I need it to use the dataset schema insted.

Then I used infer_with_pandas = False, but it is raising an error :

dataset = dataiku.Dataset("my_dataset")
dataset_df = dataset.get_dataframe()

As I wish to create a function able to be used for several different datasets for automation I can't just handle column by column for each dataset.

I think to force all columns of my dataset as string would solve this problem but wasn't to find any solution.

If anyone can help, it would be very appreciated.

Thanks

Answers

  • Konstantina
    Konstantina Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 25 ✭✭✭✭✭
    Options

    Hi,

    What is the data type of this specific column that you've mentioned in the post in the input dataset of the recipe? Is it Integer? If yes, then you need to manually change the type in the Explore tab of this dataset and use the infer_with_pandas = False inside the recipe. You also need to check whether the data type has changed in previous steps of your pipeline probably due to other recipes. I hope that helps!

    -Konstantina

  • AsishM
    AsishM Registered Posts: 4
    Options

    I agree that Dataiku needs to add better support when creating pandas dataframes from datasets. Maybe a simpler fix for Dataiku - could be to enable the use of pd.convert_dtypes (and use the nullable extension type columns) via a checkbox or a parameter in the dataiku.Dataset.get_dataframe method.

    A workaround for you might be to use dataiku.Dataset.read_schema first and then make the appropriate nullable data type conversions

  • Sv3n-Sk4
    Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
    Options

    Hello thanks for your answer.

    It is indeed an integer, and I used infer_with_pandas = False in my python recipe.

    But as I said, i need to automate it. Any manually actions would not correspond to my need.

  • Sv3n-Sk4
    Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
    Options

    Thanks for your answer!

    I am totally agree with you and think a convert_dtypes would be handly.

    I tried with read_schema but didn't find anyway to automate the conversion of the types not corresponding to my expected goals.

  • Konstantina
    Konstantina Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 25 ✭✭✭✭✭
    Options

    Is this column created/renamed through a visual recipe at some point in your flow? You might find this article useful.

  • Sv3n-Sk4
    Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
    Options

    In some projects, it is, in other ones it's not!

    As I need something replicable everywhere I need to find a global solution.

    Thanks for your link !

Setup Info
    Tags
      Help me…