parsing string column and one hot encode

ortrsa
Level 1
parsing string column and one hot encode

Hi,

I have a column with strings where each cell has several categories separated by a comma but in no particular order.

I would like to separate them and give each categorie a new column.

for example:

this cells -

1. "A, B, C" 

2. "C, B"

will convert to 3 diffrent columns named A, B and C and row 1 will get the values 1,1,1 and row  2 will get 0,1,1

 

Is it possible?

Thanks!

0 Kudos
1 Reply
CatalinaS
Dataiker

Hi @ortrsa

 

You can use a Python recipe to separate the strings from the column and then create new columns. The code below shows how to achieve this:

import dataiku
import pandas as pd, numpy as np

# Read recipe inputs
input = dataiku.Dataset("input")
df = input.get_dataframe()

for i in range(len(df)) :  #iterate over all rows
    try:
        cols = df.loc[i, "input_column"].split(",")  # split value by , to get columns
        for col in cols:                           # for each detected columns
            if not col in df:                      # create new column and set 0 to all rows
                df[col]=0
            df.loc[i,col]=1                        # for current column of current row set 1
    except Exception as e: print(e)
        
# Write recipe outputs
output = dataiku.Dataset("output")
output.write_with_schema(df)

After running this code the output dataset contains the new columns (A, B, C) with the corresponding values.

Screenshot 2023-01-13 at 19.30.43.png

โ€ƒ

0 Kudos