Community Conundrum 25: Feature Visualization is now live! Read More

How to standardize text fields using fuzzy values clustering

How to standardize text fields using fuzzy values clustering

When working with large amounts of disparate, user-entered text data, we often need to standardize or collapse entries into a resolved form. 

For example, how can we get a computer to recognize that strings like "Abraham Lincoln", "Abe Lincoln", and "Abrahm Lincoln" are actually the same category? We want to map these close variations into a smaller number of clean labels.

In the image below, the "emp_title" column has thousands of distinct values and a long tail of unique values (those that appear only once).

 

sshot1.png

 

Text cleaning strategies like normalization that standardize on case and remove punctuation may help in this regard, but they don't get us all of the way there. Similarly, if it is user-entered data, you might not be able to introduce a rule-based strategy using regular expressions.

Fuzzy Matching

The answer lies in a technique called fuzzy matching.

When viewing a dataset in the Explore tab, select Analyze from the context menu of the column header to review the number of distinct and unique values and their distribution.

sshot2.png

 

If you switch over from the Categorial tab to the Values Clustering tab, you have the option to search for clusters.

  • You can choose a clustering strategy of "Fuzzy" or "Highly fuzzy" depending on how close you require values in the cluster to be. In this case, we'll choose "Fuzzy".
  • You can also compare words or characters. For this use case, we'll compare words.

If you repeat these steps in a Prepare recipe, you can take action on these clusters.

You can adjust the final, standardized label if needed. Then, you can click "Merge Selected" to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.

sshot3.png

To take the first example, all of these 5 variations will now have the same label-- "Dept of Homeland Security".

This merge action creates a fully-transparent and editable Find and Replace step in the Prepare recipe with hundreds - or even thousands - of individual replacements.

Super simple and a huge time saver! 

What's next?

For more tips on the Prepare recipe, be sure to check out free Academy courses like Basics 102 or the Visual Recipes Overview.

Labels (3)
Version history
Revision #:
9 of 9
Last update:
‎08-27-2020 07:03 PM
Updated by: