Core Designer or Core ML Learning Paths data Model : duplicate keys (customerID) in customers.csv
Looking at customer_orders_joined in Machine Learning Basics or Basic 103 flows, it appears that 63 customerID are duplicated - ie appears twice (see attached : customers_orders_joined_dup.xlsx), with different customer profile columns, but same purchasing behaviors columns.
Looking upstream the flow, customers.csv has 75 customerID that are duplicated - ie appears twice (see attached customers_duplicate_id.xlsx ) with different customer data.
At the join recipe, when orders_by_customer and customers are joined by customerID, these duplicate customerID generate each 2 records in customers_order_joined with different information on the customer profile.
That impacts 63 of 10700 records of customers_orders_joined = 0.6 % or records, and 0.8 % of revenue in customers_order_joined. It might have minor effects on downstream models.
But is confusing in terms of data model, generating duplicate key errors when loading the customer data in a relational database with PRIMARY KEY on customerID.
Is there anything I am missing or misunderstanding here ?
Thanks for your feedback
François
Best Answer
-
taraku Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 53 Dataiker
Hi @Francois_lort
! Welcome to the Dataiku Community!I can confirm your findings. This fictitious dataset contains some duplicate customerIDs, yet the rows are not true duplicates. This information can be taken into account when performing computations. Using your project, you could experiment with managing these duplicates, including removing the duplicates.
One way to do this is to use the post filter step of the Join recipe. This is where you can tell Dataiku how to handle duplicates (based on the column used to join the datasets).
Removing duplicates can also be performed using the Distinct recipe. The Distinct recipe identifies and removes duplicate rows within a dataset and tracks which rows had duplicates, and how many, in the original dataset.
You might also want to check out this community post about duplicate management, duplicate matching, and record clustering.