Unfold all categorical columns

ABarneche
ABarneche Partner, Registered Posts: 1 Partner

Hi all,

I would like to unfold/dummify all of my categorical columns. I'm able to do that one column at a time using the Unfold recipe but I'm searching for automated way to do all of them without adding a new step for each column.

I want to dummify it for visualization and analytics purposes.

Do you have an idea ?

Thanks !


Operating system used: Windows

Tagged:

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @ABarneche
    ,

    Creating a Python recipe that performs the unfold iteratively is probably a good solution for automating adding many unfolds to a dataset.

    Here's a brief example. I use the following sample input dataset:
    Screen Shot 2022-04-11 at 6.28.44 PM.png

    If I use a prepare recipe, these are my results if I add the "unfold" step to my recipe twice, for the type and subtype columns:

    Screen Shot 2022-04-11 at 6.28.01 PM.png

    An example of doing this in a Python recipe in order to get the same results is:

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    
    mydataset = dataiku.Dataset("YOUR_DATASET_NAME")
    mydataset_df = mydataset.get_dataframe()
    
    # REPLACE with listing out each column that requires an unfold, so that the loop can go through each and perform the unfold. In my example these are 'type' and 'subtype'.  
    columns_to_unfold = ['type', 'subtype']
    # this current example works with a unique id for each row. If you have one, replace with your unique ID column name
    unique_id_column = 'id'
    first_row = True
    for unfold_col in columns_to_unfold:
        if first_row:
            subtable = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col])
            first_row = False
        else:
            additional_table = pd.crosstab(mydataset_df[unique_id_column], mydataset_df[unfold_col])
            subtable = subtable.join(additional_table, 'unique_id_column')

    Screen Shot 2022-04-11 at 6.33.53 PM.png

    If you are interested in the combination of your category columns, using a Pivot recipe would also work.

    I hope that information is helpful. If this doesn't work for your use case, please feel free to include a sample of your data and desired output, and we can take a look.

    Thanks,
    Sarina 

  • AlexGo
    AlexGo Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 18 Dataiker

    Sarina's way is definitely the recommended option, but if you want to try with the visual recipes you can also try something like 'concatenate' and then 'Create dummy columns by splitting'. You would have to ensure none of your columns had the same data though.

    The column view in the Prepare recipe can help you easily filter and select the 'Text' fields and then concatenate with a unique delimiter such as '||'

    Screen Shot 2022-04-12 at 10.00.26 AM.png

    Then you can run the Create Dummy Columns by Splitting:

    Screen Shot 2022-04-12 at 10.00.47 AM.png

Setup Info
    Tags
      Help me…