Survey banner
The Dataiku Community is moving to a new home! New posts are now disabled and the community will shortly be in temporary read only mode: LEARN MORE

Geting KMeans Cluster Labels

Geting KMeans Cluster Labels
I am trying to implement a k-means clustering algorithm and find the associated cluster labels.

For instance I have 3 clusters, processing telephone calls and know 10 of the numbers are fraudulent, I want to see what cluster the majority fall into so that I may name the cluster "Fraud". Additionally, maybe one or two numbers show up in cluster 2 name that "Maybe Fraud". And lastly "Not Fraud" on that last cluster.

Where would I find the output data that would have these classifications?

Moreover, if a record is mislabeled how would you change that label, and would the algorithm "readjust" itself and the prior data, or only future data?
0 Kudos
1 Reply
Hi Dave,

To find the output data, you need to deploy your clustering model in the flow.

DSS offers two options when you deploy clustering models:

* Either do a full retrain/recluster each time you run the "clustering recipe". In that situation, the centroids and therefore the definition of the clusters are not "stable", so you can't keep names. You can use a preparation recipe afterwards to name your clusters, but it's possible that the very definition of them will evolve

* Or, deploy a "model" to the Flow (the green losange) and use separate "training" and "scoring" recipes. In that situation, the same centroids are kept between runs, and therefore the names that you set in the Model summary screen are propagated to the output dataset.

Note that this second option is only valid for clustering algorithms that have a notion of centroid (like KMeans)

Hope this helps !
0 Kudos


Labels (1)
A banner prompting to get Dataiku