Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Clustering datasets having both numerical and categorical variables

Ouma
Level 3
Clustering datasets having both numerical and categorical variables

Hello World!

I would like to know if there is a way to directly perform Clustering while having both numerical and categorical variables in Dataiku DSS? 

Thank you!

0 Kudos
9 Replies
JeremieP
Dataiker

Hi Ouma and welcome to Dataiku Community.

You can perform clustering in DSS, whatever the types of your variables, this way :

  • Go to the Flow for your project

  • Click on the dataset you want to use

  • Select the Lab

  • Create a new visual analysis

  • Click on the Models tab

  • Select Create first model

  • Select Clustering

You will find more information on Clustering in the documentation.

Hope this helps.

 

Jérémie

0 Kudos
Ouma
Level 3
Author

Hey @JeremieP !

Thanks for your answer, but the default clustering models in this section do not support mixed Datasets clustering.

0 Kudos
JeremieP
Dataiker

What is the default clustering models that you are trying to use ? If you select Kmeans for instance, it supports mixed Datasets clustering. You have an example of this in the Dataiku Gallery here

0 Kudos
Ouma
Level 3
Author

Yes for this example we're using encoders to transform categorical variables to numeric then applying KMeans to manage to compute the distance, but in my case, there is no natural ordinal relationship between the categories within the same variable, so assigning a number to each categorical level would be meaningless.

It's the reason why I wanted to know if there is a Dataiku implementation of k-prototypes algo or others to handle this, otherwise I should add a custom python model.

0 Kudos
JeremieP
Dataiker

Dummy encoding is not about assigning a number to each category within the same variable. This is Label encoding. 
Dummy encoding is creating one variable per category within a variable and then populate the newly created categories variables with 1 or 0 depending on the value of the original variable. This is the most common way to encode categorical variables before using them in Machine Learning models. 

There is no implementation of k-prototypes in DSS so if you need to use this model, you would have to add a custom python model.

If you need to encode your variables differently, you can do it from the Design tab of your model, in the Features handling section. When you click on a categorical variable, you will see a Category handling field where you can choose between differents ways to encode or perform a custom preprocessing with Python code.

Ouma
Level 3
Author

Yes, I'll use a custom python model, since I have so many categorical variables with enormous values, encoding won't help me, Thank you @JeremieP !

0 Kudos
gjoseph
Level 2

@JeremieP, we just need the categorical features untouched to be passed to the custom clustering algorithm such as KPrototype.

 

Something like this should help:

from sklearn.preprocessing import FunctionTransformer

processor = FunctionTransformer(func=lambda x: x)

 

0 Kudos
Reznov
Level 1

K prototype requires adding name of categorical variables as a list as one of its argument while calling fit method. However in dataiku lab custom python model we only are able to  initialize the prototype model object. The fit function call code is written by dataiku. User doesn't have any control over it. This causes an error since fit method for k prototype doesn't work without list of categorical variables. Is there any way to solve this problem? 

0 Kudos
gjoseph
Level 2

Hi, I dislike to refer to you to generic documentation but this page answers your question: https://doc.dataiku.com/dss/latest/machine-learning/algorithms/in-memory-python.html#custom-models

 

You can overwrite existing functions and make them behave as your expectations.

0 Kudos