This website uses cookies. By browsing this website, you consent to the use of cookies. Learn more.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results forย

- Community
- ยป
- Discussions
- ยป
- Using Dataiku DSS
- ยป

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How to cluster geo-points according to their pairwise distances

I have a table with 5 columns: point_id (Integer); postal_code (Integer); latitude (Decimal); longitude (Decimal); geoLatLon (GeoPoint)

1. I would like to calculate the pairwise distances between these points according to my formula (par exemple, haversine distance)

2. Group these points into given number of clusters (for exemple, 2 clusters) with minimum 2 points at each cluster so that the total distance between the points of one cluster is less as possible.

How could I implement these steps? I was planning to use quick clustering model with K-means algorithm , but I have not found any tutorial on how to calculate pairwise distances between all points and pass these distances to the model.

3 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Computing all pairwise distances between points require to compute a cross-join product. It can be very expensive as it involves N_rows ^ 2 computations.

The K-means algorithm uses L2-distance, which can be seen as a local approximation of Haversine. Unless your points are very far from each other (different time zones for instance) you can safely use this approximation. Hence, I would advise going for a simple K-means first, as you did.

If you want to go further and your data has a reasonable number of rows, you could compute a distance matrix using a Join recipe to Cross-Join the dataset with itself, then use the "Post-join computed columns" section with the geoDistance formula. Then use a custom clustering function which takes a distance matrix argument, such as DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). Beware that the computational cost will be much higher than the simple approach.

Hope it helps,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I cannot use the geoDistance function in the post-join computed columns. My DSS formula is:

geoDistance(GeoPoint1, GeoPoint2)

I get the following error:

ERROR Invalid computed column: Operator cannot be translated to SQL: geoDistance

I also tried selecting SQL expression but it does not work. Where can I find information about this function? What is the problem with my formula?

Thank you very much in advance,

Joanna

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi, You need to switch the recipe engine from SQL (in-database) to DSS.