Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Fuzzy Join: When to use Relative to the Left vs Right Tables.

tgb417
Fuzzy Join: When to use Relative to the Left vs Right Tables.
I'm starting to work with the Fuzzy Joins and having good luck.

However, I'm trying to figure out when I might want to use a Relative Threshold related to the Right or Left Table when doing a overall Left Join to find duplicate records.

I understand that the proportions of items that need to match will be different based on the difference in the length of each the left and right table data elements.

But, my question is why might one be better than the other when I don't necessarily know the length of the strings in my left table and right tables.

My us case is a self join (the table to itself as both the left and right table) I've got text strings that can vary from just a few characters to a few thousand characters.   So these strings will appear in both the left and right tables at some point.

I think I understand that relative joins are good for me.  Because if I have two short vales as the left and right tables.  Then only a few substitutions are checked, and for longer data elements more characters are checked before the items are considered to be joined.

But for example if I have a short string and a long string say:

This is a short string.                                              And this is a short string made longer.

Lets say that the relative values is 50%

Why would I use relative to left vs relative to right in a deduplication use case.
Operating system used: Mac OS Senoma 14.4.1
--Tom
0 Kudos
0 Replies