BOM of a csv detected as parts of the first columns' header

david1002liu
david1002liu Registered Posts: 12 ✭✭✭

Using UTF-8 encoding the dataiku parse '\ufeff' as a part of my first column header. After research one possible solution is to use the UTF-8-sig encoding however dataiku does not support it.

Screenshot 2021-08-17 130131.png

Screenshot 2021-08-17 130222.png

Any help would be appreciated!

Answers

  • JuanE
    JuanE Dataiker, Registered Posts: 45 Dataiker

    Hello david1002liu,

    As you say, utf-8-sig is not recognized and that’s why you see that warning. What DSS version and the notebook code environment that you are using? I cannot replicate this behaviour with DSS version 9.0.2 (when I read a CSV file with a BOM, it is ignored).

    Having said that, do note that according to the Unicode standard, “use of a BOM is neither required nor recommended for UTF-8” (see Section 2.6)- so alternatively you could look into removing it from the CSV file before reading it into DSS.

    Best regards,

    Juan Eiros Zamora

    Technical Support Engineer, Dataiku

  • pmasiphelps
    pmasiphelps Dataiker, Dataiku DSS Core Designer, Registered Posts: 33 Dataiker

    Hi,

    Alternatively, you can begin your flow with a prepare recipe, and a "Rename column" processor. Rename the first column to be the same name (type it out). This should remove the byte order mark from the output dataset.

    Best,

    Pat

Setup Info
    Tags
      Help me…