BUG: Clustering with outlier detection enabled

tjh
tjh Registered Posts: 20 ✭✭✭✭

Choosing under models>settings>outlier detection the option "drop outliers" and "create a cluster with all outliers" and using the default setting of Mini-cluster size threshold = 100 leads to the following error when predicting/applying the model on a small dataset 23 rowsx10 columns (just the important part of the rather lengthy error log):

[17:53:53] [INFO] [dku.utils] - 2017-10-05 17:53:53,569 INFO ********* Pipieline state (After outliers)

[17:53:53] [INFO] [dku.utils] - 2017-10-05 17:53:53,569 INFO input_df= (23, 18)

[17:53:53] [INFO] [dku.utils] - 2017-10-05 17:53:53,569 INFO current_mf=(0, 17)

Proposed solution:

1) Do not apply the outlier detection step during prediction. The prediction can be small say minibatches, so why would detection of outliers makes sense here. Rather the model should just calculate the cluster for each instance and that's it.

2) Set the standard setting Mini-cluster size threshold=0, since it is already covered by the threshold of 1%.

3) Make it clearer what the error is. It took me and a colleague more than 3 hours to pinpoint the issue. Even dataiku support could not properly interpret the error log.

Setup Info
    Tags
      Help me…