Disable dataset column type detection

NikolayK Partner, Registered Posts: 14 Partner

After a coded Python recipe in my flow, DSS always detects a column type (meaning) as integer, even though the column is an identifier and contains numbers mixed with strings: "1332", "1363", "954", "S16", "SKO", "BRI".

I tried to set the column type in the recipe

df1 = df1.astype({"Location ID": str})

or force the column type on the output dataset, to no avail: after the next execution it's integer again.

My problem is that empty values are replaced with 'nan' if the column meaning is int.


Best Answer

  • Miguel Angel
    Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
    Answer ✓

    Hi Nikolay,

    By default, DSS scans a number of rows in the dataset and assigns the meaning that best validates the data. At times, meanings are not correctly assigned, and truthfully, it can get bothersome to manually correct them.

    Given we are talking about meanings for a code recipe output, an option is to specify in the script the meanings of the output columns. This is exemplified in the documentation: https://doc.dataiku.com/dss/latest/python-api/meanings.html#assigning-a-meaning-to-a-column

    A working example for the assignment of the 'Country' meanings you mentioned could be the following:

    Let's say I have an input dataset like this (where the autodetected meaning wrongfully determines we have a list of countries):


    In a code recipe we could specify meanings by doing:


    Please note that lines 13-15 are only needed if you plan on building the whole dataset schema. If the dataset already exists, that is it is not its first build, only the meanings change is necessary.

    Additionally, we use 'write_dataframe' instead of 'write_with_schema', else DSS will still try to infer the meanings.

    Finally, the output dataset would show the meanings we want:



Setup Info
      Help me…