Disable dataset column type detection
After a coded Python recipe in my flow, DSS always detects a column type (meaning) as integer, even though the column is an identifier and contains numbers mixed with strings: "1332", "1363", "954", "S16", "SKO", "BRI".
I tried to set the column type in the recipe
df1 = df1.astype({"Location ID": str})
or force the column type on the output dataset, to no avail: after the next execution it's integer again.
My problem is that empty values are replaced with 'nan' if the column meaning is int.
Best Answer
-
Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
Hi Nikolay,
By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.
Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column
A working example for the assignment of the 'Country' meanings you mentioned could be the following:
Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):
In a code recipe we could specify meanings by doing:
Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.
Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.
Finally, the output dataset would show the meanings we want:
Answers
-
Similarly, if there are values "US" in a column, the meaning is detected as Country, while all other values have nothing to do with it, e.g. "S20", "S26".
-
Hi @MiguelangelC
,Thanks you for such a detailed answer! Very insightful.
In my case, write_dataframe() seems to do the trick, provided I manually set the required meaning on the dataset.