Removing duplicate columns

frotograf · December 2020

Dear community,

I have such a case:

- I have a large database that needs cleaning.

- while performing the typical cleaning activities (parsing etc.) I discovered that I have numerous columns that are just duplicates of one another (judging by basic analysis it's hundreds) but with different names.

Example: 1 column name is "things_bought_on_2021_03_07", it's duplicates have names like "things_bought_on_2021_03_07_01" and "things_bought_on_2021_03_07_02".

I know none of the ways to deal with this in Dataiku. Working on duplicate rows would be easier (I do not have duplicate rows on this one..)

Thank you!

CoreyS · December 2020

@frotograf
welcome to the community! You can achieve this through a Prepare Recipe. Since you are new to DSS, I would recommend utilizing the Dataiku Academy Core Designer Learning Path. The 102 course in the path goes over Prepare Recipes.

I hope this helps!

frotograf · December 2020

I did not think of receipt preparation. Thanks for the tip. I've been using various resources from Dataiku and I totally love it. As a non-coder (but understanding python basics) this tool has been a godsend for me!

CoreyS · December 2020

Thank you for the feedback! Yes i am very much a clicker myself and the Prepare Recipe is a really powerful tool in DSS.

frotograf · December 2020

I'll get back if I won't find what I'm looking for there, so no worries. Maybe sb will have the same problem!

frotograf · December 2020

Thanks for the link @CoreyS
. I've got through the academy's 101 & part of 102 but I still have to manually click on the data set. The more advanced options are for the python users and I cannot think the answer to my Q while being a no-code user.

Still the tool is fantastic and I got to remember some of the functions of the DDS - I see some new additions to the software since I started using it this year!

Mateusz · December 2020

Hi @frotograf

Could you please show me a 'fake' dataset with 3-4 cases where I can see what duplicates looks like and explain in more datails which values you want to remove? It can be sreenshot or some small portion of the data in the xlsx.

frotograf · December 2020

@emate
thanks for the post!

It looks like that in the screenshot. The columns are the same, the rows have the same values, the only difference with them is the number on the end of the column (the ones I marked in circles).

rachelli24 · August 2022

Did you ever figure this out?

CoreyS · August 2022

Hi @rachelli24
and welcome to the Dataiku Community. Have you tried using the Delete/Keep columns by name processor in the Prepare Recipe?

Here are some other resources you may find helpful with the Prepare Recipe:

Data Preparation (Knowledge Base)

I hope this helps!

Removing duplicate columns

Answers

Categories

Setup Info

Tags