Disable dataset column type detection

NikolayK · ‎10-26-2022

After a coded Python recipe in my flow, DSS always detects a column type (meaning) as integer, even though the column is an identifier and contains numbers mixed with strings: "1332", "1363", "954", "S16", "SKO", "BRI".

I tried to set the column type in the recipe

df1 = df1.astype({"Location ID": str})

or force the column type on the output dataset, to no avail: after the next execution it's integer again.

My problem is that empty values are replaced with 'nan' if the column meaning is int.

MiguelangelC · ‎10-26-2022

Hi Nikolay,

By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

A working example for the assignment of the 'Country' meanings you mentioned could be the following:

Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):

In a code recipe we could specify meanings by doing:

Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

Finally, the output dataset would show the meanings we want:

View solution in original post

NikolayK · ‎10-26-2022

Similarly, if there are values "US" in a column, the meaning is detected as Country, while all other values have nothing to do with it, e.g. "S20", "S26".

MiguelangelC · ‎10-26-2022

Hi Nikolay,

By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

A working example for the assignment of the 'Country' meanings you mentioned could be the following:

Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):

In a code recipe we could specify meanings by doing:

Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

Finally, the output dataset would show the meanings we want:

NikolayK · ‎10-26-2022

Hi @MiguelangelC,

Thanks you for such a detailed answer! Very insightful.

In my case, write_dataframe() seems to do the trick, provided I manually set the required meaning on the dataset.

Sign up to take part

Disable dataset column type detection

Disable dataset column type detection

Setup info