Interpreting cluster results

ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi all,

I am testing our clustering on a data table of customer records. As a starting point I tried Interactive Clustering on a dataset of 6.5 million records, with approx. 10 columns containing stats on a records behaviour.

I am slightly puzzled by the results, shown below:

ben_p_0-1587111807372.png

I have almost all my customers in a single cluster, with only a handful falling into other clusters - why would clustering yield such results and how might I go about making the clusters more useful?

Ben

Best Answer

Answers

  • Mattsco
    Mattsco Dataiker, Registered Posts: 125 Dataiker

    Hi Ben,

    In interactive clustering, we first run a K-mean algorithm.

    K-mean is sensitive to outliers and noise. So in your case, you end with all the observations in the same cluster and 4 clusters of outliers.

    To have better results you can try to use in Outliers Detection in the Design part: Create a cluster with outliers.
    You'll have only one cluster with outliers.

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Thanks Matt,

    I ran a simpler k-means on the data and got much more balanced segments - I still don't understand how this two-step clustering provides extra insight into the clusters and allows them to be explored after clustering, can you explain this?

    Ben

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Thanks again Matt, when you say "modify the clustering", this has to be done manually, right?

    Apologies if this is a dumb question!

  • cwentz
    cwentz Dataiku DSS Core Concepts, Registered Posts: 33 ✭✭✭✭

    Hello @ben_p
    and @Mattsco

    Regarding using interactive clustering. After reviewing several studies similar to my project I did this using the number of rows as the pre-cluster number and used the outlier detection as it's own cluster. I have 6 clusters and a 7th that is outliers, however, it is very hard to understand what is going on in the summary page or the heat map. Do you have a dumb downed version for me?

Setup Info
    Tags
      Help me…