Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Normalize text without lowercase

UserBird
Dataiker
Dataiker
Normalize text without lowercase
The "Simplify text" processor (and a handful of others) have a "Normalize text" option that "transforms to lowercase, removes accents and performs Unicode normalization (Café -> cafe)." Anyone figure out a way to remove accents and perform unicode normalization but not change the case? Café -> Cafe
0 Kudos
3 Replies
Clément_Stenac
Dataiker
Dataiker
Hi,

This is not possible via the Simplify text processor. You could do it with a custom Python processor
0 Kudos
UserBird
Dataiker
Dataiker
Author
Is a "custom Python processor" different than just the "Python function" processor? I couldn't get the latter to work with the unicodedata module as described here: https://stackoverflow.com/a/16467505/612166 I suspect that's a limitation of the Jython executor?

What normalization form does DSS use, anyhow?
0 Kudos
Mattsco
Dataiker
Dataiker
Hi,
You can use a replace in the python processor with this:
spec = u"²ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž"
norm = u"2AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
Mattsco
0 Kudos

Labels

?
Labels (4)
A banner prompting to get Dataiku