Can I delete some duplicates with my iPhone?
Hello,
I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.
Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?
Thank you for your support.
PS: I'm fairly new to DSS hence the question.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
You can could count duplicate phone numbers using a visual Group By visual recipe and later join original dataset with the datasets from the group by recipe .
You can also use a python recipe with pandas duplicated() for example :
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
If you want to drop rows based on duplicates in a single column you can use drop_duplicates()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(subset=['phone_number'], keep='last')
Not sure what you mean by flagging the rest by several conditions; can you elaborate a bit? There are several visual processors to flag rows:
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
As you get more advanced with the question you are asking one of the things you will likely discover about duplicates is that they can be tricky. Although you definitely can reduce the number of duplicates in your data set it is often not possible to find and remove all. So having some reasonable expectations about the completeness of the results can be helpful.
In the case of phone numbers there are often several ok, correct ways to write a phone number. Including things like with and without extensions, with and without international dialing codes, with and without long distance prefixes.
One of the ways of improving your match rate for duplicates is to standardize the fields before looking for duplicates. Dataiku does not have a tool to do this directly built in. However, Dataiku DSS does allow for the use of libraries from other languages like Python and R. One of the more advanced approaches to the problem I suspect you are trying to solve is to work on standardizing the phone numbers before looking for dupes. There are lots of ways to try to do this, however the approach that I’ll often use is to use a library that someone else has written for this purpose. In the case of phone numbers something like phone numbers library in python can help. https://pypi.org/project/phonenumbers/ .
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Your subject references “my iPhone”. Regarding your question I’m wondering how the iPhone is involved in your question about duplicates. Can you share a bit more about that if it is important?