Removing duplicate columns

frotograf
Level 2
Removing duplicate columns

Dear community,

I have such a case:

- I have a large database that needs cleaning.

- while performing the typical cleaning activities (parsing etc.) I discovered that I have numerous columns that are just duplicates of one another (judging by basic analysis it's hundreds) but with different names. 

Example: 1 column name is "things_bought_on_2021_03_07", it's duplicates have names like "things_bought_on_2021_03_07_01" and "things_bought_on_2021_03_07_02".

I know none of the ways to deal with this in Dataiku. Working on duplicate rows would be easier ๐Ÿ˜‰ (I do not have duplicate rows on this one..)

Thank you! 

9 Replies
CoreyS
Dataiker Alumni

@frotograf welcome to the community! You can achieve this through a Prepare Recipe. Since you are new to DSS, I would recommend utilizing the Dataiku Academy Core Designer Learning Path. The 102 course in the path goes over Prepare Recipes.

I hope this helps!

 

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
frotograf
Level 2
Author

 I did not think of receipt preparation. Thanks for the tip. I've been using various resources from Dataiku and I totally love it. As a non-coder (but understanding python basics) this tool has been a godsend for me! 

CoreyS
Dataiker Alumni

Thank you for the feedback! Yes i am very much a clicker myself and the Prepare Recipe is a really powerful tool in DSS.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
frotograf
Level 2
Author

I'll get back if I won't find what I'm looking for there, so no worries. Maybe sb will have the same problem! ๐Ÿ™‚ 

0 Kudos
frotograf
Level 2
Author

Thanks for the link @CoreyS. I've got through the academy's 101 & part of 102 but I still have to manually click on the data set. The more advanced options are for the python users and I cannot think the answer to my Q while being a no-code user. 

Still the tool is fantastic and I got to remember some of the functions of the DDS - I see some new additions to the software since I started using it this year! 

0 Kudos
emate
Level 5

Hi @frotograf 

Could you please show me a 'fake' dataset with 3-4 cases where I can see what duplicates looks like and explain in more datails which values you want to remove? It can be sreenshot or some small portion of the data in the xlsx.

 

0 Kudos
frotograf
Level 2
Author

@emate thanks for the post! 

It looks like that in the screenshot. The columns are the same, the rows have the same values, the only difference with them is the number on the end of the column (the ones I marked in circles).

0 Kudos
rachelli24
Level 1

Did you ever figure this out?

0 Kudos
CoreyS
Dataiker Alumni

Hi @rachelli24 and welcome to the Dataiku Community. Have you tried using the Delete/Keep columns by name processor in the Prepare Recipe?

Here are some other resources you may find helpful with the Prepare Recipe:

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos