Can you fix the K-means train results?

kamegai_satoshi · ‎11-18-2020

We fixed the seed in K-means, but the clusters we got for each train were different. Specifically, the Variables importance and silhouette were different.

Are there any other settings needed besides seed to fix the results? We think that changing results every time under the same conditions is a big problem for business use.

How to reproduce it is as follows.

Create a project by importing Sample project(name=Predicting Churn).
Changing the setting of Visual Analysis(name=Clustering customers into segments).
- Algorithms > KMeans > Seed = 1000
Run the TRAIN of "Clustering customers into segments".

CoreyS · ‎01-04-2021

Hi, @kamegai_satoshi ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

EmmaH · ‎03-17-2021

Hi @kamegai_satoshi ,

Dataiku uses the Scikit-learn KMeans clustering algorithm in the Visual ML, so the seed that you're defining is the random_state parameter. This parameter uses the same randomly-initialized point as a centroid each time you rerun the algorithm.

From the KMeans doc:

If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.

Essentially, the algorithm hits the max number of iterations before it can assign a consistent label.

A solution for this in Dataiku could be to add a custom python model with an increased number of iterations:

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=5, n_init=1, random_state = 1000)

Best,

Emma

Sign up to take part

Can you fix the K-means train results?

Can you fix the K-means train results?