Clustering datasets having both numerical and categorical variables

Ouma · May 2021

Hello World!

I would like to know if there is a way to directly perform Clustering while having both numerical and categorical variables in Dataiku DSS?

Thank you!

JeremieP · May 2021

Hi Ouma and welcome to Dataiku Community.

You can perform clustering in DSS, whatever the types of your variables, this way :

Go to the Flow for your project
Click on the dataset you want to use
Select the Lab
Create a new visual analysis
Click on the Models tab
Select Create first model
Select Clustering

You will find more information on Clustering in the documentation.

Hope this helps.

Jérémie

Ouma · May 2021

Hey @JeremieP
!

Thanks for your answer, but the default clustering models in this section do not support mixed Datasets clustering.

JeremieP · May 2021

What is the default clustering models that you are trying to use ? If you select Kmeans for instance, it supports mixed Datasets clustering. You have an example of this in the Dataiku Gallery here

Ouma · May 2021

Yes for this example we're using encoders to transform categorical variables to numeric then applying KMeans to manage to compute the distance, but in my case, there is no natural ordinal relationship between the categories within the same variable, so assigning a number to each categorical level would be meaningless.

It's the reason why I wanted to know if there is a Dataiku implementation of k-prototypes algo or others to handle this, otherwise I should add a custom python model.

JeremieP · May 2021

Dummy encoding is not about assigning a number to each category within the same variable. This is Label encoding.
Dummy encoding is creating one variable per category within a variable and then populate the newly created categories variables with 1 or 0 depending on the value of the original variable. This is the most common way to encode categorical variables before using them in Machine Learning models.

There is no implementation of k-prototypes in DSS so if you need to use this model, you would have to add a custom python model.

If you need to encode your variables differently, you can do it from the Design tab of your model, in the Features handling section. When you click on a categorical variable, you will see a Category handling field where you can choose between differents ways to encode or perform a custom preprocessing with Python code.

Ouma · May 2021

Yes, I'll use a custom python model, since I have so many categorical variables with enormous values, encoding won't help me, Thank you @JeremieP
!

Georghios · October 2023

@JeremieP
, we just need the categorical features untouched to be passed to the custom clustering algorithm such as KPrototype.

Something like this should help:

from sklearn.preprocessing import FunctionTransformer

processor = FunctionTransformer(func=lambda x: x)

Reznov · May 4

K prototype requires adding name of categorical variables as a list as one of its argument while calling fit method. However in dataiku lab custom python model we only are able to initialize the prototype model object. The fit function call code is written by dataiku. User doesn't have any control over it. This causes an error since fit method for k prototype doesn't work without list of categorical variables. Is there any way to solve this problem?

Georghios · May 4

Hi, I dislike to refer to you to generic documentation but this page answers your question: https://doc.dataiku.com/dss/latest/machine-learning/algorithms/in-memory-python.html#custom-models

You can overwrite existing functions and make them behave as your expectations.

Clustering datasets having both numerical and categorical variables

Answers

Categories

Setup Info

Tags