Can you fix the K-means train results?
We fixed the seed in K-means, but the clusters we got for each train were different. Specifically, the Variables importance and silhouette were different.
Are there any other settings needed besides seed to fix the results? We think that changing results every time under the same conditions is a big problem for business use.
How to reproduce it is as follows.
- Create a project by importing Sample project(name=Predicting Churn).
- Changing the setting of Visual Analysis(name=Clustering customers into segments).
- Algorithms > KMeans > Seed = 1000
- Run the TRAIN of "Clustering customers into segments".
Answers
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi, @kamegai_satoshi
! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community. -
Hi @kamegai_satoshi
,Dataiku uses the Scikit-learn KMeans clustering algorithm in the Visual ML, so the seed that you're defining is the random_state parameter. This parameter uses the same randomly-initialized point as a centroid each time you rerun the algorithm.
From the KMeans doc:
If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.Essentially, the algorithm hits the max number of iterations before it can assign a consistent label.
A solution for this in Dataiku could be to add a custom python model with an increased number of iterations:
from sklearn.cluster import KMeans clf = KMeans(n_clusters=5, n_init=1, random_state = 1000)
Best,
Emma