regex columns in a custom Preparation processor

NN · ‎02-09-2021

sHey Everyone,

I want to attempt to create a custom Prepare processor and was reading one of the documents which helps me edit multiple columns
https://doc.dataiku.com/dss/latest/plugins/reference/preparation.html#output-multiple-columns

In the example shared it shows the Input column as a single column for the dataset list.
Is there any way that i can use a regex (which user inputs) to derive the list of columns to be edited .

MehdiH · ‎04-12-2021

Hi @NN ,

Great job on your custom python processor !

Unfortunately, as all rows go through the python processor, they are all highlighted even if their content is not modified: the Python step returns (in "row" mode) an entire row for each row, so as of today the display assumes it's all been modified.

Cheers !

View solution in original post

CoreyS · ‎02-11-2021

Hi, @NN! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

NN · ‎02-12-2021

Hi,

I was able to resolve my primary requirement using the examples shared by Dataiku and the question i ask below are not important but just good to learn if someone can guide me.

I am on dataiku 8.01 trying to create a custom processor for the prepare recipe.

The aim is that the user provides a regex for column names.
And if we find a value (example:1) in the column we replace it with another value (Example:10)

In my processor.json the "mode": "ROW",

The processor.py will be something like below

def process(row):    
    keylist=row.keys()
    r = re.compile(params.get('user_regex'), re.IGNORECASE)
    newlist = list(filter(r.match, keylist))
    for col in newlist:
           if row[col]=="1":
                row[col]="10"
           elif row[col]="30":
                row[col]="300"
            
    return row

I first ran a Find and replace processor in Prepare recipe (using the multiple columns option) This highlights only the cells which are modified as you can see in the second image below. and the note also shows 2 rows modified.

However when i run my custom processor which is the third image below, it shows all 5 rows as modified.
Though the value has only changed in 2 cells.

While this works almost perfectly for my need , my question is can i improve it a step further and make it similar to the Find and Replace recipe to only highlight specific cells or rows or even just the columns which are modified instead of the entire data.

MehdiH · ‎04-12-2021

Hi @NN ,

Great job on your custom python processor !

Unfortunately, as all rows go through the python processor, they are all highlighted even if their content is not modified: the Python step returns (in "row" mode) an entire row for each row, so as of today the display assumes it's all been modified.

Cheers !

NN · ‎04-12-2021

Thanks @MehdiH

That makes sense.

Sign up to take part

regex columns in a custom Preparation processor

regex columns in a custom Preparation processor