The prepare recipe in Dataiku is very powerful. It's the fastest way to script up complex data transformations by far. But sometimes, the processors are not quite enough. Whether I use Dataiku Formula Language, Python, or SQL, I often need to create custom steps in my recipes. While I can always create a totally separate code recipe to do the processing I'm looking for, it's usually more convenient to create code steps in a prepare recipe between other data preparation steps. This works well enough if the transformation is specific to my dataset, but sometimes, I need to execute the same transformation against many datasets. For this, it'd be very helpful to be able to create custom processors.
I'd like to be able to write some code that can hook into the script step UI elements- for example, the column selector and configuration radio buttons, then publish it so it's available for everyone in my team to use, listed in the processors library in the prepare recipe. That way, I can use it everywhere. It'd also be great if these custom processors could be added via plugin or otherwise installed from the internet so useful processors could be contributed back to the larger community and made accessible for everyone. This would make the prepare script immensely more powerful in enterprises where processors specific to a company's needs can be built and distributed throughout organizations. Among the many use-cases, I could imagine the processors hooking into internal APIs (one of my main uses of Python processors today) and simplifying otherwise complex tasks.
I'd also like to be able to save specific configurations of existing processors to be reused. For example, if I write regex that can extract patterns specific to my company's data, I'd love to be able to save those patterns into a library and recall them later, for example. I'd love the same feature in find and replace. For the tokenize text processor, I'd like to be able to define and publish a custom tokenizer. Etc. etc.