New to Dataiku DSS? Try out our NEW Quick Start Programs today and get onboarded on the product in just one hour! Let's go

Normalize text without lowercase

UserBird
Dataiker
Dataiker
Normalize text without lowercase
The "Simplify text" processor (and a handful of others) have a "Normalize text" option that "transforms to lowercase, removes accents and performs Unicode normalization (Café -> cafe)." Anyone figure out a way to remove accents and perform unicode normalization but not change the case? Café -> Cafe
0 Kudos
3 Replies
Clément_Stenac
Dataiker
Dataiker
Hi,

This is not possible via the Simplify text processor. You could do it with a custom Python processor
0 Kudos
UserBird
Dataiker
Dataiker
Author
Is a "custom Python processor" different than just the "Python function" processor? I couldn't get the latter to work with the unicodedata module as described here: https://stackoverflow.com/a/16467505/612166 I suspect that's a limitation of the Jython executor?

What normalization form does DSS use, anyhow?
0 Kudos
Mattsco
Dataiker
Dataiker
Hi,
You can use a replace in the python processor with this:
spec = u"²ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž"
norm = u"2AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
Mattsco
0 Kudos
Labels (4)
A banner prompting to get Dataiku DSS