Fuzzy Join: When to use Relative to the Left vs Right Tables.
tgb417
Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I'm starting to work with the Fuzzy Joins and having good luck.
However, I'm trying to figure out when I might want to use a Relative Threshold related to the Right or Left Table when doing a overall Left Join to find duplicate records.
I understand that the proportions of items that need to match will be different based on the difference in the length of each the left and right table data elements.
But, my question is why might one be better than the other when I don't necessarily know the length of the strings in my left table and right tables.
My us case is a self join (the table to itself as both the left and right table) I've got text strings that can vary from just a few characters to a few thousand characters. So these strings will appear in both the left and right tables at some point.
I think I understand that relative joins are good for me. Because if I have two short vales as the left and right tables. Then only a few substitutions are checked, and for longer data elements more characters are checked before the items are considered to be joined.
But for example if I have a short string and a long string say:
This is a short string. And this is a short string made longer.
Lets say that the relative values is 50%
Why would I use relative to left vs relative to right in a deduplication use case.
Operating system used: Mac OS Senoma 14.4.1
Operating system used: Mac OS Senoma 14.4.1