regex columns in a custom Preparation processor

NN
NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

sHey Everyone,

I want to attempt to create a custom Prepare processor and was reading one of the documents which helps me edit multiple columns
https://doc.dataiku.com/dss/latest/plugins/reference/preparation.html#output-multiple-columns

In the example shared it shows the Input column as a single column for the dataset list.
Is there any way that i can use a regex (which user inputs) to derive the list of columns to be edited .

Best Answer

  • MehdiH
    MehdiH Dataiker, Dataiku DSS Core Designer, Dataiku DSS Core Concepts Posts: 21 Dataiker
    Answer ✓

    Hi @NN
    ,

    Great job on your custom python processor !

    Unfortunately, as all rows go through the python processor, they are all highlighted even if their content is not modified: the Python step returns (in "row" mode) an entire row for each row, so as of today the display assumes it's all been modified.

    Cheers !

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi, @NN
    ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron
    edited July 17

    Hi,

    I was able to resolve my primary requirement using the examples shared by Dataiku and the question i ask below are not important but just good to learn if someone can guide me.

    I am on dataiku 8.01 trying to create a custom processor for the prepare recipe.

    The aim is that the user provides a regex for column names.
    And if we find a value (example:1) in the column we replace it with another value (Example:10)

    In my processor.json the "mode": "ROW",

    The processor.py will be something like below

    def process(row):    
        keylist=row.keys()
        r = re.compile(params.get('user_regex'), re.IGNORECASE)
        newlist = list(filter(r.match, keylist))
        for col in newlist:
               if row[col]=="1":
                    row[col]="10"
               elif row[col]="30":
                    row[col]="300"
                
        return row

    I first ran a Find and replace processor in Prepare recipe (using the multiple columns option) This highlights only the cells which are modified as you can see in the second image below. and the note also shows 2 rows modified.

    However when i run my custom processor which is the third image below, it shows all 5 rows as modified.
    Though the value has only changed in 2 cells.

    While this works almost perfectly for my need , my question is can i improve it a step further and make it similar to the Find and Replace recipe to only highlight specific cells or rows or even just the columns which are modified instead of the entire data.

    processor.JPG

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Thanks @MehdiH

    That makes sense.

Setup Info
    Tags
      Help me…