Entity Resolution / Record Linkage Plug-In

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

User Story:

As an Analyst who deals with messy CRM data with a significant population of duplicates (record clusters) spread across multiple incomplete records, it would be lovely if there was a Record Linkage Dedupe Plugin available for DSS that would make this process more accessible to a broader set of analysts. There are a number of packages in the Python library world to do this kind of work. When this process is easier and more complete and we can find more records that belong to the same data clusters we will get more accurate analyses and models.

COS

  • Coding/algorithm
    • The plugin should be based on a generally available package or packages that are under active maintenance.
  • Results
    • The process returns one record for each originally supplied record.
    • All original record columns are returned with unchanged values.
      • Any normalization done to do the record linkage should NOT be applied to the returned results by default.
    • The results add a new key for each cluster of records.
    • The results add a probability score that the record is in the cluster identified
  • Configuration
    • Ability to choose a subset of columns on which to do the record linkage
    • For each column to be used as part of the record linkage the analyst needs to be able to choose the type of data in the column.
      • Particularly when a column may
  • Models would be stored for re-use
  • There are needs for 2 plugin Elements that can share a common m
    • There would be a training element
    • There would be a processing element

Nice to Have:

  • The tool can use external data stores like PostgreSQL, or even big data tools like SPARK, Snowflake
  • Configuration:
    • An option to keep the record normalization applied by the plugin.
    • An option to choose which of the original data columns to retain in the output set.

Notes

2
2 votes

In the Backlog · Last Updated

Comments

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    If Dataiku is not interested in taking this on as a company plug-in. I'd be interested in working on a team to create this plug-in on a community basis. I'd likely take on pandas-dedupe as the core library for my plugin.

    Please reach out if you are interested in colaborating.

    --Tom

  • JCR
    JCR Dataiker, Dataiku DSS Core Designer, Product Ideas Manager Posts: 18 Dataiker

    @tgb417
    that's a great idea. Would be awesome to have more community-based extensions. We just started a program to enable and streamline this, @StephenWagner
    can help.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
  • StephenWagner
    StephenWagner Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Product Ideas Manager Posts: 6 Dataiker

    Gread idea @tgb417

    We've created the Dataiku Integration Development Guide to provide the community guidance in creating integrations and plugins - please take a look.

    I'd recommend creating a post in the Plugins & Extending Dataiku section of the Community to try and pull a development team together.

    - Steve, TPM - Ecosystem

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @StephenWagner
    ,

    Ok, I've got a post out on the Plugin & Extending Dataiku board.

    https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Entity-Resolution-Record-Linkage-Plug-In/td-p/19880

    Are you aware of anyone else interested in this specific topic, either customers, partners, or Staff?

    I'd love to reach out to folks to gauge interest.

    --Tom

  • StephenWagner
    StephenWagner Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Product Ideas Manager Posts: 6 Dataiker

    @tgb417
    Unfortunately, I'm not aware of anyone.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @simonmd

    Welcome to the Dataiku Community. I'd love to hear more about your interest in record linkage.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    There is a bit more discussion over here about the need for record linkage.

    https://community.dataiku.com/t5/Using-Dataiku/Name-Normalisation/m-p/33191#M12258

Setup Info
    Tags
      Help me…