Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Normalize text without lowercase

UserBird
Dataiker
Normalize text without lowercase
The "Simplify text" processor (and a handful of others) have a "Normalize text" option that "transforms to lowercase, removes accents and performs Unicode normalization (Cafรฉ -> cafe)." Anyone figure out a way to remove accents and perform unicode normalization but not change the case? Cafรฉ -> Cafe
0 Kudos
3 Replies
Clรฉment_Stenac
Hi,

This is not possible via the Simplify text processor. You could do it with a custom Python processor
0 Kudos
UserBird
Dataiker
Author
Is a "custom Python processor" different than just the "Python function" processor? I couldn't get the latter to work with the unicodedata module as described here: https://stackoverflow.com/a/16467505/612166 I suspect that's a limitation of the Jython executor?

What normalization form does DSS use, anyhow?
0 Kudos
Mattsco
Dataiker
Hi,
You can use a replace in the python processor with this:
spec = u"ยฒร€รร‚รƒร„ร…ร รกรขรฃรครฅฤ€ฤฤ‚ฤƒฤ„ฤ…ร‡รงฤ†ฤ‡ฤˆฤ‰ฤŠฤ‹ฤŒฤรรฐฤŽฤฤฤ‘รˆร‰รŠร‹รจรฉรชรซฤ’ฤ“ฤ”ฤ•ฤ–ฤ—ฤ˜ฤ™ฤšฤ›ฤœฤฤžฤŸฤ ฤกฤขฤฃฤคฤฅฤฆฤงรŒรรŽรรฌรญรฎรฏฤจฤฉฤชฤซฤฌฤญฤฎฤฏฤฐฤฑฤดฤตฤถฤทฤธฤนฤบฤปฤผฤฝฤพฤฟล€ลล‚ร‘รฑลƒล„ล…ล†ล‡ลˆล‰ลŠล‹ร’ร“ร”ร•ร–ร˜รฒรณรดรตรถรธลŒลลŽลลล‘ล”ล•ล–ล—ล˜ล™ลšล›ลœลลžลŸล ลกลฟลขลฃลคลฅลฆลงร™รšร›รœรนรบรปรผลจลฉลชลซลฌลญลฎลฏลฐลฑลฒลณลดลตรรฝรฟลถลทลธลนลบลปลผลฝลพ"
norm = u"2AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
Mattsco
0 Kudos

Labels

?
Labels (4)
A banner prompting to get Dataiku