Can you fix the Kmeans train results?
We fixed the seed in Kmeans, but the clusters we got for each train were different. Specifically, the Variables importance and silhouette were different.
Are there any other settings needed besides seed to fix the results? We think that changing results every time under the same conditions is a big problem for business use.
How to reproduce it is as follows.
 Create a project by importing Sample project(name=Predicting Churn).
 Changing the setting of Visual Analysis(name=Clustering customers into segments).
 Algorithms > KMeans > Seed = 1000
 Run the TRAIN of "Clustering customers into segments".
Answers

CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi, @kamegai_satoshi
! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community. 
Hi @kamegai_satoshi
,Dataiku uses the Scikitlearn KMeans clustering algorithm in the Visual ML, so the seed that you're defining is the random_state parameter. This parameter uses the same randomlyinitialized point as a centroid each time you rerun the algorithm.
From the KMeans doc:
If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.Essentially, the algorithm hits the max number of iterations before it can assign a consistent label.
A solution for this in Dataiku could be to add a custom python model with an increased number of iterations:
from sklearn.cluster import KMeans clf = KMeans(n_clusters=5, n_init=1, random_state = 1000)
Best,
Emma