Can you fix the K-means train results?

kamegai_satoshi
kamegai_satoshi Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1 Partner

We fixed the seed in K-means, but the clusters we got for each train were different. Specifically, the Variables importance and silhouette were different.

Are there any other settings needed besides seed to fix the results? We think that changing results every time under the same conditions is a big problem for business use.

How to reproduce it is as follows.

  • Create a project by importing Sample project(name=Predicting Churn).
  • Changing the setting of Visual Analysis(name=Clustering customers into segments).
    • Algorithms > KMeans > Seed = 1000
  • Run the TRAIN of "Clustering customers into segments".

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi, @kamegai_satoshi
    ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

  • EmmaH
    EmmaH Dataiker Posts: 4 Dataiker
    edited July 17

    Hi @kamegai_satoshi
    ,

    Dataiku uses the Scikit-learn KMeans clustering algorithm in the Visual ML, so the seed that you're defining is the random_state parameter. This parameter uses the same randomly-initialized point as a centroid each time you rerun the algorithm.

    From the KMeans doc:

    If the algorithm stops before fully converging (because of ``tol`` or
    ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
    i.e. the ``cluster_centers_`` will not be the means of the points in each
    cluster. Also, the estimator will reassign ``labels_`` after the last
    iteration to make ``labels_`` consistent with ``predict`` on the training
    set.

    Essentially, the algorithm hits the max number of iterations before it can assign a consistent label.

    A solution for this in Dataiku could be to add a custom python model with an increased number of iterations:

    from sklearn.cluster import KMeans
    
    clf = KMeans(n_clusters=5, n_init=1, random_state = 1000)

    Best,

    Emma

Setup Info
    Tags
      Help me…