When reading dataiku dataset, can we read empty string as empty string instead of float NaN?

Blossom
Level 1
When reading dataiku dataset, can we read empty string as empty string instead of float NaN?

Hello  community,

I'm developing dataiku recipes. I used a recipe to transform json response to dataframe and write the dataframe as output to dataiku. Pandas will fill not-exsiting-field for some rows with Float NaN automatically, which is not a problem. 

But when we read the csv file in dataiku, pandas will treat empty string as Float NaN as well, which has an impact to our work flow. I know that in pandas, there is parameter with which we can treat empty string as empty string. But it seems like this parameter doesn't exist in dataiku?

pd.read_csv('test.csv', keep_default_na=False)

 Solutions that I have tried:

1. Verify the schema and make sure that the schema is correct.

2. infer_with_pandas=False

3. Save the dataset in other format like parquet

But none of them works.

Do you have some suggestions on what I should do?

 

Regards

0 Kudos
5 Replies
SarinaS
Dataiker

Hi @Blossom ,

I don't think that I've been able to reproduce exactly what you are describing.  Would you mind providing some more information on your setup and the issue that you are running into in your workflow? Please attach some screenshots as well outlining the unexpected results that are a issue in your workflow if possible as well. 

How are you reading the CSV file in DSS? Is it from a Python recipe? Where are you seeing the NaN values in DSS? 

Here's what I tried.  Reading in a csv file with pandas.read_csv() and no parameters passed, works as you described:

Screen Shot 2021-05-11 at 3.33.25 PM.png

If I write this dataframe to a dataset in DSS, this is what it looks like though, which seems expected to me: 

Screen Shot 2021-05-11 at 3.40.09 PM.pngโ€ƒ

In your workflow does something unexpected happen at this step?  

In my Python notebook, if I instead add the parameter to fill NaN values with an empty string, this also seemed to work as described and seems expected to me, though it does convert the dtypes for my columns to objects: 

Screen Shot 2021-05-11 at 3.36.32 PM.png

So I am not sure exactly where the unexpected results are coming in.  If you can illustrate your use case a little more that would be helpful!

Thank you,
Sarinaโ€ƒ

โ€ƒ

Blossom
Level 1
Author

Hello @SarinaS 

Thanks for you reply.

Take your photo as an example because this looks like the dataset that I have in DSS, with empty string values. We have checked our original dataframe before writing it to DSS, until here the type is correct.

dataiku.PNG

 

But after writing this dataset to DSS, and when I try to read this DSS dataset back to Dataframe, it will read these empty strings as NaN. Here the picture shows when this error produces.

dataiku2.PNGFor our work, some of the values are indeed NaN, but others should be empty string. I would like to distinguish them. Can you try by your side to see if you have the same error, please ?

 

0 Kudos
Ignacio_Toledo

Hello @Blossom@SarinaS,

I can replicate this behavior when reading back from a dataiku dataset to a pandas dataframe. The main problem is for the "string" type columns, were you actually get a nan instead of an empty string.

I've fixed this previously by replacing the 'nan' values in the string columns of a pandas dataframe with empty strings ( df.column_name.fillna('', inplace=True) ), but still it is a strange behavior for string columns. No problems for me when is a float column, but for the string ones is kind of confusing.

Hope this helps to the discussion.

0 Kudos
AlexT
Dataiker

@Blossom 

Starting with DSS 9.0.4 keep_default_na and na_values parameters were added to get_dataframe() and iter_dataframes()

https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.get_dataframe

Hope that helps with your case. 

jcre
Level 1

Thanks for this answer I stumbled upon, keep_default_na seems to solve a problem of mine.

Could we get the doc updated for these new parameters ? (float_precision, na_values, keep_default_na). For now, they are referenced in the doc but don't have a description like other parameters. I'd like to be sure it does what I want (and only what I want) before using it in actual projects.

0 Kudos