Normalize text without lowercase

Options
UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
The "Simplify text" processor (and a handful of others) have a "Normalize text" option that "transforms to lowercase, removes accents and performs Unicode normalization (Café -> cafe)." Anyone figure out a way to remove accents and perform unicode normalization but not change the case? Café -> Cafe

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer Posts: 753 Dataiker
    Options
    Hi,

    This is not possible via the Simplify text processor. You could do it with a custom Python processor
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Options
    Is a "custom Python processor" different than just the "Python function" processor? I couldn't get the latter to work with the unicodedata module as described here: https://stackoverflow.com/a/16467505/612166 I suspect that's a limitation of the Jython executor?

    What normalization form does DSS use, anyhow?
  • Mattsco
    Mattsco Administrator, Dataiker Posts: 125 Administrator
    Options
    Hi,
    You can use a replace in the python processor with this:
    spec = u"²ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž"
    norm = u"2AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
Setup Info
    Tags
      Help me…