Anyone interested in working on a team to create an Entity Resolution / Record Linkage Plug-In to be used inside Dataiku DSS. The vision would be to make this open-sourse. I'm particularly interested in this because I'm working with a Cascade Bicycle Club that could use a plugin like this to better manage its constituents. I know of several other non-profits who could also benefit from such a tool.
If you are interested in chipping in on this open-source project please reach out.
As a Non-Profit Analyst who deals with messy CRM data with a significant population of duplicates (record clusters) spread across multiple incomplete records, it would be lovely if there was a Record Linkage Dedupe Plugin available for DSS that would make this process more accessible to a broader set of analysts. There are a number of packages in the Python library world to do this kind of work. When this process is easier and more complete and we can find more records that belong to the same data clusters we will get more accurate analyses and models.
- The plugin should be based on a generally available package or packages that are under active maintenance.
- The process returns one record for each originally supplied record.
- All original record columns are returned with unchanged values.
- Any normalization done to do the record linkage should NOT be applied to the returned results by default.
- The results add a new key for each cluster of records.
- The results add a probability score that the record is in the cluster identified
- Ability to choose a subset of columns on which to do the record linkage
- For each column to be used as part of the record linkage the analyst needs to be able to choose the type of data in the column.
- Particularly when a column may
- Models would be stored for re-use
- There are needs for 2 plugin Elements that can share a common model.
- There would be a training element
- There would be a processing element
Nice to Have:
- The tool can use external data stores like PostgreSQL, or even big data tools like SPARK, Snowflake
- An option to keep the record normalization applied by the plugin.
- An option to choose which of the original data columns to retain in the output set.
- There are several Python packages out there that may be promising candidates as the basis of this plug-in. I've discovered the following.