Read dataframe with datatype

davidmakovoz
davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron
edited July 16 in Using Dataiku

I have a dataset with a column 'Serial Number' with data type string, Text (see attached)

When I read it in a notebook

mydataset = dataiku.Dataset(dataset_name)
df_f3 = mydataset.get_dataframe()
df_f3['Serial Number'].dtypes

I get dtype('int64')

And it's too late to convert it to string, because the original values have leading 0's which are lost when the values are read as integers.

How can I force it to read the column as a string? I tried

df_f3 = mydataset.get_dataframe(infer_with_pandas=False)

but this failed for an unrelated reason, in a different column

ValueError: Integer column has NA values in column 47

I'm using DSS Version 9.0.7

Answers

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 17

    Hi @davidmakovoz
    ,

    If you want to keep the original values leading zero you should indeed use

    df_f3 = mydataset.get_dataframe(infer_with_pandas=False) 

    In your case is failing because most likely there are empty cells in the other column and pandas is not able to deal with empty integers, it converts to double and uses NaN for empty value.

    You should check if there are empty values in the other column and replace the empty values with an integer like 0.

  • JulienD
    JulienD Registered Posts: 3
    edited July 17

    Hi

    I have a similar problem but about a column of type array:

    ds = dataiku.Dataset("my_input")
    next((c for c in ds.cols if c['name'] == 'my_col'))
    # {u'arrayContent': {u'name': u'', u'type': u'string'}, # u'name': u'my_col', # u'type': u'array'}

    But when I convert it to dataframe with

    `df = ds.to_dataframe()` # or `ds.to_dataframe(infer_with_panda=False)` 

    the column type of the my_col column is object and when I process it with something like

    df['type_of_col'] = df['my_col'].apply(type)

    It returns <type 'str'>

    I need to apply a function that expect a list of String. How can I keep that column type from the dataiku dataset to the pandas dataframe?

Setup Info
    Tags
      Help me…