Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi all,
I am testing our clustering on a data table of customer records. As a starting point I tried Interactive Clustering on a dataset of 6.5 million records, with approx. 10 columns containing stats on a records behaviour.
I am slightly puzzled by the results, shown below:
I have almost all my customers in a single cluster, with only a handful falling into other clusters - why would clustering yield such results and how might I go about making the clusters more useful?
Ben
Interactive clustering is a 2 steps process.
First you train a K-mean, then you can modify yourself the clustering, merging 2 clusters together for example.
If you are interested to do your own grouping of data, you can check also the interactive decision tree builder:
https://www.dataiku.com/product/plugins/interactive-decision-tree-builder/
Hi Ben,
In interactive clustering, we first run a K-mean algorithm.
K-mean is sensitive to outliers and noise. So in your case, you end with all the observations in the same cluster and 4 clusters of outliers.
To have better results you can try to use in Outliers Detection in the Design part: Create a cluster with outliers.
You'll have only one cluster with outliers.
Thanks Matt,
I ran a simpler k-means on the data and got much more balanced segments - I still don't understand how this two-step clustering provides extra insight into the clusters and allows them to be explored after clustering, can you explain this?
Ben
Interactive clustering is a 2 steps process.
First you train a K-mean, then you can modify yourself the clustering, merging 2 clusters together for example.
If you are interested to do your own grouping of data, you can check also the interactive decision tree builder:
https://www.dataiku.com/product/plugins/interactive-decision-tree-builder/
Thanks again Matt, when you say "modify the clustering", this has to be done manually, right?
Apologies if this is a dumb question!
Regarding using interactive clustering. After reviewing several studies similar to my project I did this using the number of rows as the pre-cluster number and used the outlier detection as it's own cluster. I have 6 clusters and a 7th that is outliers, however, it is very hard to understand what is going on in the summary page or the heat map. Do you have a dumb downed version for me?