Disable dataset column type detection

Solved!
NikolayK
Level 3
Disable dataset column type detection

After a coded Python recipe in my flow, DSS always detects a column type (meaning) as integer, even though the column is an identifier and contains numbers mixed with strings: "1332", "1363", "954", "S16", "SKO", "BRI".

I tried to set the column type in the recipe

df1 = df1.astype({"Location ID": str})

or force the column type on the output dataset, to no avail: after the next execution it's integer again.

My problem is that empty values are replaced with 'nan' if the column meaning is int.

1 Solution
MiguelangelC
Dataiker

Hi Nikolay,

By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

A working example for the assignment of the 'Country' meanings you mentioned could be the following:

Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):

input.PNG

In a code recipe we could specify meanings by doing:

Code.PNG

Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

Finally, the output dataset would show the meanings we want:

output.PNG

 

 

View solution in original post

3 Replies
NikolayK
Level 3
Author

Similarly, if there are values "US" in a column, the meaning is detected as Country, while all other values have nothing to do with it, e.g. "S20", "S26".

0 Kudos
MiguelangelC
Dataiker

Hi Nikolay,

By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

A working example for the assignment of the 'Country' meanings you mentioned could be the following:

Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):

input.PNG

In a code recipe we could specify meanings by doing:

Code.PNG

Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

Finally, the output dataset would show the meanings we want:

output.PNG

 

 

NikolayK
Level 3
Author

Hi @MiguelangelC,

Thanks you for such a detailed answer! Very insightful.

In my case, write_dataframe() seems to do the trick, provided I manually set the required meaning on the dataset.

0 Kudos