Unfold all categorical columns

ABarneche
Level 1
Unfold all categorical columns

Hi all,

I would like to unfold/dummify all of my categorical columns. I'm able to do that one column at a time using the Unfold recipe but I'm searching for automated way to do all of them without adding a new step for each column.

I want to dummify it for visualization and analytics purposes.

Do you have an idea ?

Thanks !


Operating system used: Windows

0 Kudos
2 Replies
SarinaS
Dataiker

Hi @ABarneche,

Creating a Python recipe that performs the unfold iteratively is probably a good solution for automating adding many unfolds to a dataset. 

Here's a brief example. I use the following sample input dataset:
Screen Shot 2022-04-11 at 6.28.44 PM.png

If I use a prepare recipe, these are my results if I add the "unfold" step to my recipe twice, for the type and subtype columns:

Screen Shot 2022-04-11 at 6.28.01 PM.png

An example of doing this in a Python recipe in order to get the same results is:

 

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

mydataset = dataiku.Dataset("YOUR_DATASET_NAME")
mydataset_df = mydataset.get_dataframe()

# REPLACE with listing out each column that requires an unfold, so that the loop can go through each and perform the unfold. In my example these are 'type' and 'subtype'.  
columns_to_unfold = ['type', 'subtype']
# this current example works with a unique id for each row. If you have one, replace with your unique ID column name
unique_id_column = 'id'
first_row = True
for unfold_col in columns_to_unfold:
    if first_row:
        subtable = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col])
        first_row = False
    else:
        additional_table = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col])
        subtable = subtable.join(additional_table, 'unique_id_column')

 

โ€ƒ

Screen Shot 2022-04-11 at 6.33.53 PM.png

If you are interested in the combination of your category columns, using a Pivot recipe would also work.

I hope that information is helpful. If this doesn't work for your use case, please feel free to include a sample of your data and desired output, and we can take a look.

Thanks,
Sarinaโ€ƒ

0 Kudos
AlexGo
Dataiker

Sarina's way is definitely the recommended option, but if you want to try with the visual recipes you can also try something like 'concatenate' and then 'Create dummy columns by splitting'. You would have to ensure none of your columns had the same data though.

The column view in the Prepare recipe can help you easily filter and select the 'Text' fields and then concatenate with a unique delimiter such as '||'

Screen Shot 2022-04-12 at 10.00.26 AM.png

 

Then you can run the Create Dummy Columns by Splitting:

Screen Shot 2022-04-12 at 10.00.47 AM.png

 

0 Kudos