Clustering datasets having both numerical and categorical variables

Options
Ouma
Ouma Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 12 ✭✭✭✭

Hello World!

I would like to know if there is a way to directly perform Clustering while having both numerical and categorical variables in Dataiku DSS?

Thank you!

Answers

  • JeremieP
    JeremieP Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner Posts: 7 Dataiker
    Options

    Hi Ouma and welcome to Dataiku Community.

    You can perform clustering in DSS, whatever the types of your variables, this way :

    • Go to the Flow for your project

    • Click on the dataset you want to use

    • Select the Lab

    • Create a new visual analysis

    • Click on the Models tab

    • Select Create first model

    • Select Clustering

    You will find more information on Clustering in the documentation.

    Hope this helps.

    Jérémie

  • Ouma
    Ouma Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 12 ✭✭✭✭
    Options

    Hey @JeremieP
    !

    Thanks for your answer, but the default clustering models in this section do not support mixed Datasets clustering.

  • JeremieP
    JeremieP Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner Posts: 7 Dataiker
    Options

    What is the default clustering models that you are trying to use ? If you select Kmeans for instance, it supports mixed Datasets clustering. You have an example of this in the Dataiku Gallery here

  • Ouma
    Ouma Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 12 ✭✭✭✭
    Options

    Yes for this example we're using encoders to transform categorical variables to numeric then applying KMeans to manage to compute the distance, but in my case, there is no natural ordinal relationship between the categories within the same variable, so assigning a number to each categorical level would be meaningless.

    It's the reason why I wanted to know if there is a Dataiku implementation of k-prototypes algo or others to handle this, otherwise I should add a custom python model.

  • JeremieP
    JeremieP Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner Posts: 7 Dataiker
    Options

    Dummy encoding is not about assigning a number to each category within the same variable. This is Label encoding.
    Dummy encoding is creating one variable per category within a variable and then populate the newly created categories variables with 1 or 0 depending on the value of the original variable. This is the most common way to encode categorical variables before using them in Machine Learning models.

    There is no implementation of k-prototypes in DSS so if you need to use this model, you would have to add a custom python model.

    If you need to encode your variables differently, you can do it from the Design tab of your model, in the Features handling section. When you click on a categorical variable, you will see a Category handling field where you can choose between differents ways to encode or perform a custom preprocessing with Python code.

  • Ouma
    Ouma Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 12 ✭✭✭✭
    Options

    Yes, I'll use a custom python model, since I have so many categorical variables with enormous values, encoding won't help me, Thank you @JeremieP
    !

  • gjoseph
    gjoseph Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15
    edited July 17
    Options

    @JeremieP
    , we just need the categorical features untouched to be passed to the custom clustering algorithm such as KPrototype.

    Something like this should help:

    from sklearn.preprocessing import FunctionTransformer
    
    processor = FunctionTransformer(func=lambda x: x)

  • Reznov
    Reznov Registered Posts: 1
    Options

    K prototype requires adding name of categorical variables as a list as one of its argument while calling fit method. However in dataiku lab custom python model we only are able to initialize the prototype model object. The fit function call code is written by dataiku. User doesn't have any control over it. This causes an error since fit method for k prototype doesn't work without list of categorical variables. Is there any way to solve this problem?

  • gjoseph
    gjoseph Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15
    Options

    Hi, I dislike to refer to you to generic documentation but this page answers your question: https://doc.dataiku.com/dss/latest/machine-learning/algorithms/in-memory-python.html#custom-models

    You can overwrite existing functions and make them behave as your expectations.

Setup Info
    Tags
      Help me…