## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

How to cluster geo-points according to their pairwise distances

I have a table with 5 columns: point_id (Integer); postal_code (Integer); latitude (Decimal); longitude (Decimal); geoLatLon (GeoPoint)

1. I would like to calculate the pairwise distances between these points according to my formula (par exemple, haversine distance)

2. Group these points into given number of clusters (for exemple, 2 clusters) with minimum 2 points at each cluster so that the total distance between the points of one cluster is less as possible.

How could I implement these steps? I was planning to use quick clustering model with K-means algorithm , but I have not found any tutorial on how to calculate pairwise distances between all points and pass these distances to the model.

Solutions shown first - Read whole discussion

3 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Computing all pairwise distances between points require to compute a cross-join product. It can be very expensive as it involves N_rows ^ 2 computations.

The K-means algorithm uses L2-distance, which can be seen as a local approximation of Haversine. Unless your points are very far from each other (different time zones for instance) you can safely use this approximation. Hence, I would advise going for a simple K-means first, as you did.

If you want to go further and your data has a reasonable number of rows, you could compute a distance matrix using a Join recipe to Cross-Join the dataset with itself, then use the "Post-join computed columns" section with the geoDistance formula. Then use a custom clustering function which takes a distance matrix argument, such as DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). Beware that the computational cost will be much higher than the simple approach.

Hope it helps,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I cannot use the geoDistance function in the post-join computed columns. My DSS formula is:

geoDistance(GeoPoint1, GeoPoint2)

I get the following error:

ERROR Invalid computed column: Operator cannot be translated to SQL: geoDistance

I also tried selecting SQL expression but it does not work. Where can I find information about this function? What is the problem with my formula?

Thank you very much in advance,

Joanna

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi, You need to switch the recipe engine from SQL (in-database) to DSS.