How to cluster geopoints according to their pairwise distances
Ekaterina
Registered Posts: 1 ✭✭✭✭
I have a table with 5 columns: point_id (Integer); postal_code (Integer); latitude (Decimal); longitude (Decimal); geoLatLon (GeoPoint)
1. I would like to calculate the pairwise distances between these points according to my formula (par exemple, haversine distance)
2. Group these points into given number of clusters (for exemple, 2 clusters) with minimum 2 points at each cluster so that the total distance between the points of one cluster is less as possible.
How could I implement these steps? I was planning to use quick clustering model with Kmeans algorithm , but I have not found any tutorial on how to calculate pairwise distances between all points and pass these distances to the model.
Tagged:
Answers

Hi,
Computing all pairwise distances between points require to compute a crossjoin product. It can be very expensive as it involves N_rows ^ 2 computations.
The Kmeans algorithm uses L2distance, which can be seen as a local approximation of Haversine. Unless your points are very far from each other (different time zones for instance) you can safely use this approximation. Hence, I would advise going for a simple Kmeans first, as you did.
If you want to go further and your data has a reasonable number of rows, you could compute a distance matrix using a Join recipe to CrossJoin the dataset with itself, then use the "Postjoin computed columns" section with the geoDistance formula. Then use a custom clustering function which takes a distance matrix argument, such as DBSCAN (https://scikitlearn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). Beware that the computational cost will be much higher than the simple approach.
Hope it helps,
Alex 
Joanna Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 ✭✭✭✭Hi,
I cannot use the geoDistance function in the postjoin computed columns. My DSS formula is:
geoDistance(GeoPoint1, GeoPoint2)
I get the following error:
ERROR Invalid computed column: Operator cannot be translated to SQL: geoDistance
I also tried selecting SQL expression but it does not work. Where can I find information about this function? What is the problem with my formula?
Thank you very much in advance,
Joanna 
Hi, You need to switch the recipe engine from SQL (indatabase) to DSS.