Unfold all categorical columns
Hi all,
I would like to unfold/dummify all of my categorical columns. I'm able to do that one column at a time using the Unfold recipe but I'm searching for automated way to do all of them without adding a new step for each column.
I want to dummify it for visualization and analytics purposes.
Do you have an idea ?
Thanks !
Operating system used: Windows
Answers
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @ABarneche
,
Creating a Python recipe that performs the unfold iteratively is probably a good solution for automating adding many unfolds to a dataset.
Here's a brief example. I use the following sample input dataset:If I use a prepare recipe, these are my results if I add the "unfold" step to my recipe twice, for the type and subtype columns:
An example of doing this in a Python recipe in order to get the same results is:
import dataiku from dataiku import pandasutils as pdu import pandas as pd mydataset = dataiku.Dataset("YOUR_DATASET_NAME") mydataset_df = mydataset.get_dataframe() # REPLACE with listing out each column that requires an unfold, so that the loop can go through each and perform the unfold. In my example these are 'type' and 'subtype'. columns_to_unfold = ['type', 'subtype'] # this current example works with a unique id for each row. If you have one, replace with your unique ID column name unique_id_column = 'id' first_row = True for unfold_col in columns_to_unfold: if first_row: subtable = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col]) first_row = False else: additional_table = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col]) subtable = subtable.join(additional_table, 'unique_id_column')
If you are interested in the combination of your category columns, using a Pivot recipe would also work.
I hope that information is helpful. If this doesn't work for your use case, please feel free to include a sample of your data and desired output, and we can take a look.
Thanks,
Sarina -
AlexGo Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 18 Dataiker
Sarina's way is definitely the recommended option, but if you want to try with the visual recipes you can also try something like 'concatenate' and then 'Create dummy columns by splitting'. You would have to ensure none of your columns had the same data though.
The column view in the Prepare recipe can help you easily filter and select the 'Text' fields and then concatenate with a unique delimiter such as '||'
Then you can run the Create Dummy Columns by Splitting: