I would like to know if there is a way to directly perform Clustering while having both numerical and categorical variables in Dataiku DSS?
Hi Ouma and welcome to Dataiku Community.
You can perform clustering in DSS, whatever the types of your variables, this way :
Go to the Flow for your project
Click on the dataset you want to use
Select the Lab
Create a new visual analysis
Click on the Models tab
Select Create first model
You will find more information on Clustering in the documentation.
Hope this helps.
Yes for this example we're using encoders to transform categorical variables to numeric then applying KMeans to manage to compute the distance, but in my case, there is no natural ordinal relationship between the categories within the same variable, so assigning a number to each categorical level would be meaningless.
It's the reason why I wanted to know if there is a Dataiku implementation of k-prototypes algo or others to handle this, otherwise I should add a custom python model.
Dummy encoding is not about assigning a number to each category within the same variable. This is Label encoding.
Dummy encoding is creating one variable per category within a variable and then populate the newly created categories variables with 1 or 0 depending on the value of the original variable. This is the most common way to encode categorical variables before using them in Machine Learning models.
There is no implementation of k-prototypes in DSS so if you need to use this model, you would have to add a custom python model.
If you need to encode your variables differently, you can do it from the Design tab of your model, in the Features handling section. When you click on a categorical variable, you will see a Category handling field where you can choose between differents ways to encode or perform a custom preprocessing with Python code.