Clustering datasets having both numerical and categorical variables
Hello World!
I would like to know if there is a way to directly perform Clustering while having both numerical and categorical variables in Dataiku DSS?
Thank you!
Answers
-
Hi Ouma and welcome to Dataiku Community.
You can perform clustering in DSS, whatever the types of your variables, this way :
Go to the Flow for your project
Click on the dataset you want to use
Select the Lab
Create a new visual analysis
Click on the Models tab
Select Create first model
Select Clustering
You will find more information on Clustering in the documentation.
Hope this helps.
Jérémie
-
What is the default clustering models that you are trying to use ? If you select Kmeans for instance, it supports mixed Datasets clustering. You have an example of this in the Dataiku Gallery here
-
Ouma Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 12 ✭✭✭✭
Yes for this example we're using encoders to transform categorical variables to numeric then applying KMeans to manage to compute the distance, but in my case, there is no natural ordinal relationship between the categories within the same variable, so assigning a number to each categorical level would be meaningless.
It's the reason why I wanted to know if there is a Dataiku implementation of k-prototypes algo or others to handle this, otherwise I should add a custom python model.
-
Dummy encoding is not about assigning a number to each category within the same variable. This is Label encoding.
Dummy encoding is creating one variable per category within a variable and then populate the newly created categories variables with 1 or 0 depending on the value of the original variable. This is the most common way to encode categorical variables before using them in Machine Learning models.There is no implementation of k-prototypes in DSS so if you need to use this model, you would have to add a custom python model.
If you need to encode your variables differently, you can do it from the Design tab of your model, in the Features handling section. When you click on a categorical variable, you will see a Category handling field where you can choose between differents ways to encode or perform a custom preprocessing with Python code.
-
Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
@JeremieP
, we just need the categorical features untouched to be passed to the custom clustering algorithm such as KPrototype.Something like this should help:
from sklearn.preprocessing import FunctionTransformer processor = FunctionTransformer(func=lambda x: x)
-
K prototype requires adding name of categorical variables as a list as one of its argument while calling fit method. However in dataiku lab custom python model we only are able to initialize the prototype model object. The fit function call code is written by dataiku. User doesn't have any control over it. This causes an error since fit method for k prototype doesn't work without list of categorical variables. Is there any way to solve this problem?
-
Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
Hi, I dislike to refer to you to generic documentation but this page answers your question: https://doc.dataiku.com/dss/latest/machine-learning/algorithms/in-memory-python.html#custom-models
You can overwrite existing functions and make them behave as your expectations.