parsing string column and one hot encode
ortrsa
Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 ✭
Hi,
I have a column with strings where each cell has several categories separated by a comma but in no particular order.
I would like to separate them and give each categorie a new column.
for example:
this cells -
1. "A, B, C"
2. "C, B"
will convert to 3 diffrent columns named A, B and C and row 1 will get the values 1,1,1 and row 2 will get 0,1,1
Is it possible?
Thanks!
Answers
-
Hi @ortrsa
,You can use a Python recipe to separate the strings from the column and then create new columns. The code below shows how to achieve this:
import dataiku import pandas as pd, numpy as np # Read recipe inputs input = dataiku.Dataset("input") df = input.get_dataframe() for i in range(len(df)) : #iterate over all rows try: cols = df.loc[i, "input_column"].split(",") # split value by , to get columns for col in cols: # for each detected columns if not col in df: # create new column and set 0 to all rows df[col]=0 df.loc[i,col]=1 # for current column of current row set 1 except Exception as e: print(e) # Write recipe outputs output = dataiku.Dataset("output") output.write_with_schema(df)
After running this code the output dataset contains the new columns (A, B, C) with the corresponding values.