Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku!

How to cluster geo-points according to their pairwise distances

Level 1
How to cluster geo-points according to their pairwise distances

I have a table with 5 columns: point_id (Integer); postal_code (Integer); latitude (Decimal); longitude (Decimal); geoLatLon (GeoPoint)

1. I would like to calculate the pairwise distances between these points according to my formula (par exemple, haversine distance)

2. Group these points into given number of clusters (for exemple, 2 clusters) with minimum 2 points at each cluster  so that the total distance between the points of one cluster is less as possible.

How could I implement these steps? I was planning to use quick clustering model with K-means algorithm , but I have not found any tutorial on how to calculate pairwise distances between all points and pass these distances to the model.

3 Replies
Dataiker Alumni
Hi,

Computing all pairwise distances between points require to compute a cross-join product. It can be very expensive as it involves N_rows ^ 2 computations.

The K-means algorithm uses L2-distance, which can be seen as a local approximation of Haversine. Unless your points are very far from each other (different time zones for instance) you can safely use this approximation. Hence, I would advise going for a simple K-means first, as you did.

If you want to go further and your data has a reasonable number of rows, you could compute a distance matrix using a Join recipe to Cross-Join the dataset with itself, then use the "Post-join computed columns" section with the geoDistance formula. Then use a custom clustering function which takes a distance matrix argument, such as DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). Beware that the computational cost will be much higher than the simple approach.

Hope it helps,

Alex
Level 1
Hi,

I cannot use the geoDistance function in the post-join computed columns. My DSS formula is:
geoDistance(GeoPoint1, GeoPoint2)
I get the following error:
ERROR Invalid computed column: Operator cannot be translated to SQL: geoDistance

I also tried selecting SQL expression but it does not work. Where can I find information about this function? What is the problem with my formula?

Thank you very much in advance,

Joanna
Dataiker Alumni
Hi, You need to switch the recipe engine from SQL (in-database) to DSS.

Labels (2)