Can you fix the K-means train results?

kamegai_satoshi · November 2020

We fixed the seed in K-means, but the clusters we got for each train were different. Specifically, the Variables importance and silhouette were different.

Are there any other settings needed besides seed to fix the results? We think that changing results every time under the same conditions is a big problem for business use.

How to reproduce it is as follows.

Create a project by importing Sample project(name=Predicting Churn).
Changing the setting of Visual Analysis(name=Clustering customers into segments).
- Algorithms > KMeans > Seed = 1000
Run the TRAIN of "Clustering customers into segments".

CoreyS · January 2021

Hi, @kamegai_satoshi
! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

EmmaH · March 2021

Hi @kamegai_satoshi
,

Dataiku uses the Scikit-learn KMeans clustering algorithm in the Visual ML, so the seed that you're defining is the random_state parameter. This parameter uses the same randomly-initialized point as a centroid each time you rerun the algorithm.

From the KMeans doc:

If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.

Essentially, the algorithm hits the max number of iterations before it can assign a consistent label.

A solution for this in Dataiku could be to add a custom python model with an increased number of iterations:

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=5, n_init=1, random_state = 1000)

Best,

Emma

Can you fix the K-means train results?

Answers

Categories

Setup Info

Tags