Geting KMeans Cluster Labels
Options
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
I am trying to implement a kmeans clustering algorithm and find the associated cluster labels.
For instance I have 3 clusters, processing telephone calls and know 10 of the numbers are fraudulent, I want to see what cluster the majority fall into so that I may name the cluster "Fraud". Additionally, maybe one or two numbers show up in cluster 2 name that "Maybe Fraud". And lastly "Not Fraud" on that last cluster.
Where would I find the output data that would have these classifications?
Moreover, if a record is mislabeled how would you change that label, and would the algorithm "readjust" itself and the prior data, or only future data?
For instance I have 3 clusters, processing telephone calls and know 10 of the numbers are fraudulent, I want to see what cluster the majority fall into so that I may name the cluster "Fraud". Additionally, maybe one or two numbers show up in cluster 2 name that "Maybe Fraud". And lastly "Not Fraud" on that last cluster.
Where would I find the output data that would have these classifications?
Moreover, if a record is mislabeled how would you change that label, and would the algorithm "readjust" itself and the prior data, or only future data?
Tagged:
Answers

Hi Dave,
To find the output data, you need to deploy your clustering model in the flow.
DSS offers two options when you deploy clustering models:
* Either do a full retrain/recluster each time you run the "clustering recipe". In that situation, the centroids and therefore the definition of the clusters are not "stable", so you can't keep names. You can use a preparation recipe afterwards to name your clusters, but it's possible that the very definition of them will evolve
* Or, deploy a "model" to the Flow (the green losange) and use separate "training" and "scoring" recipes. In that situation, the same centroids are kept between runs, and therefore the names that you set in the Model summary screen are propagated to the output dataset.
Note that this second option is only valid for clustering algorithms that have a notion of centroid (like KMeans)
Hope this helps !