When reading dataiku dataset, can we read empty string as empty string instead of float NaN?

Blossom
Blossom Registered Posts: 8 ✭✭✭✭

Hello community,

I'm developing dataiku recipes. I used a recipe to transform json response to dataframe and write the dataframe as output to dataiku. Pandas will fill not-exsiting-field for some rows with Float NaN automatically, which is not a problem.

But when we read the csv file in dataiku, pandas will treat empty string as Float NaN as well, which has an impact to our work flow. I know that in pandas, there is parameter with which we can treat empty string as empty string. But it seems like this parameter doesn't exist in dataiku?

pd.read_csv('test.csv', keep_default_na=False)

Solutions that I have tried:

1. Verify the schema and make sure that the schema is correct.

2. infer_with_pandas=False

3. Save the dataset in other format like parquet

But none of them works.

Do you have some suggestions on what I should do?

Regards

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @Blossom
    ,

    I don't think that I've been able to reproduce exactly what you are describing. Would you mind providing some more information on your setup and the issue that you are running into in your workflow? Please attach some screenshots as well outlining the unexpected results that are a issue in your workflow if possible as well.

    How are you reading the CSV file in DSS? Is it from a Python recipe? Where are you seeing the NaN values in DSS?

    Here's what I tried. Reading in a csv file with pandas.read_csv() and no parameters passed, works as you described:

    Screen Shot 2021-05-11 at 3.33.25 PM.png

    If I write this dataframe to a dataset in DSS, this is what it looks like though, which seems expected to me:

    Screen Shot 2021-05-11 at 3.40.09 PM.png

    In your workflow does something unexpected happen at this step?

    In my Python notebook, if I instead add the parameter to fill NaN values with an empty string, this also seemed to work as described and seems expected to me, though it does convert the dtypes for my columns to objects:

    Screen Shot 2021-05-11 at 3.36.32 PM.png

    So I am not sure exactly where the unexpected results are coming in. If you can illustrate your use case a little more that would be helpful!

    Thank you,
    Sarina 

  • Blossom
    Blossom Registered Posts: 8 ✭✭✭✭

    Hello @SarinaS

    Thanks for you reply.

    Take your photo as an example because this looks like the dataset that I have in DSS, with empty string values. We have checked our original dataframe before writing it to DSS, until here the type is correct.

    dataiku.PNG

    But after writing this dataset to DSS, and when I try to read this DSS dataset back to Dataframe, it will read these empty strings as NaN. Here the picture shows when this error produces.

    dataiku2.PNGFor our work, some of the values are indeed NaN, but others should be empty string. I would like to distinguish them. Can you try by your side to see if you have the same error, please ?

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hello @Blossom
    , @SarinaS
    ,

    I can replicate this behavior when reading back from a dataiku dataset to a pandas dataframe. The main problem is for the "string" type columns, were you actually get a nan instead of an empty string.

    I've fixed this previously by replacing the 'nan' values in the string columns of a pandas dataframe with empty strings ( df.column_name.fillna('', inplace=True) ), but still it is a strange behavior for string columns. No problems for me when is a float column, but for the string ones is kind of confusing.

    Hope this helps to the discussion.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,237 Dataiker

    @Blossom

    Starting with DSS 9.0.4 keep_default_na and na_values parameters were added to get_dataframe() and iter_dataframes()

    https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.get_dataframe

    Hope that helps with your case.

  • jcre
    jcre Registered Posts: 2 ✭✭✭

    Thanks for this answer I stumbled upon, keep_default_na seems to solve a problem of mine.

    Could we get the doc updated for these new parameters ? (float_precision, na_values, keep_default_na). For now, they are referenced in the doc but don't have a description like other parameters. I'd like to be sure it does what I want (and only what I want) before using it in actual projects.

Setup Info
    Tags
      Help me…