Disable dataset column type detection

NikolayK · October 2022

After a coded Python recipe in my flow, DSS always detects a column type (meaning) as integer, even though the column is an identifier and contains numbers mixed with strings: "1332", "1363", "954", "S16", "SKO", "BRI".

I tried to set the column type in the recipe

df1 = df1.astype({"Location ID": str})

or force the column type on the output dataset, to no avail: after the next execution it's integer again.

My problem is that empty values are replaced with 'nan' if the column meaning is int.

Miguel Angel · October 2022

Hi Nikolay,

By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

A working example for the assignment of the 'Country' meanings you mentioned could be the following:

Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):

In a code recipe we could specify meanings by doing:

Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

Finally, the output dataset would show the meanings we want:

NikolayK · October 2022

Similarly, if there are values "US" in a column, the meaning is detected as Country, while all other values have nothing to do with it, e.g. "S20", "S26".

NikolayK · October 2022

Hi @MiguelangelC
,

Thanks you for such a detailed answer! Very insightful.

In my case, write_dataframe() seems to do the trick, provided I manually set the required meaning on the dataset.

Disable dataset column type detection

Best Answer

Answers

Categories

Setup Info

Tags