As an Analyst who deals with messy CRM data with a significant population of duplicates (record clusters) spread across multiple incomplete records, it would be lovely if there was a Record Linkage Dedupe Plugin available for DSS that would make this process more accessible to a broader set of analysts. There are a number of packages in the Python library world to do this kind of work. When this process is easier and more complete and we can find more records that belong to the same data clusters we will get more accurate analyses and models.
Nice to Have:
If Dataiku is not interested in taking this on as a company plug-in. I'd be interested in working on a team to create this plug-in on a community basis. I'd likely take on pandas-dedupe as the core library for my plugin.
Please reach out if you are interested in colaborating.
@tgb417 that's a great idea. Would be awesome to have more community-based extensions. We just started a program to enable and streamline this, @StephenWagner can help.
Gread idea @tgb417
We've created the Dataiku Integration Development Guide to provide the community guidance in creating integrations and plugins - please take a look.
I'd recommend creating a post in the Plugins & Extending Dataiku section of the Community to try and pull a development team together.
- Steve, TPM - Ecosystem
Ok, I've got a post out on the Plugin & Extending Dataiku board.
Are you aware of anyone else interested in this specific topic, either customers, partners, or Staff?
I'd love to reach out to folks to gauge interest.
There is a bit more discussion over here about the need for record linkage.
Only members of the Community can comment.