parsing string column and one hot encode

ortrsa
ortrsa Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1

Hi,

I have a column with strings where each cell has several categories separated by a comma but in no particular order.

I would like to separate them and give each categorie a new column.

for example:

this cells -

1. "A, B, C"

2. "C, B"

will convert to 3 diffrent columns named A, B and C and row 1 will get the values 1,1,1 and row 2 will get 0,1,1

Is it possible?

Thanks!

Answers

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 17

    Hi @ortrsa
    ,

    You can use a Python recipe to separate the strings from the column and then create new columns. The code below shows how to achieve this:

    import dataiku
    import pandas as pd, numpy as np
    
    # Read recipe inputs
    input = dataiku.Dataset("input")
    df = input.get_dataframe()
    
    for i in range(len(df)) :  #iterate over all rows
        try:
            cols = df.loc[i, "input_column"].split(",")  # split value by , to get columns
            for col in cols:                           # for each detected columns
                if not col in df:                      # create new column and set 0 to all rows
                    df[col]=0
                df.loc[i,col]=1                        # for current column of current row set 1
        except Exception as e: print(e)
            
    # Write recipe outputs
    output = dataiku.Dataset("output")
    output.write_with_schema(df)
    

    After running this code the output dataset contains the new columns (A, B, C) with the corresponding values.

    Screenshot 2023-01-13 at 19.30.43.png

Setup Info
    Tags
      Help me…