I have such a case:
- I have a large database that needs cleaning.
- while performing the typical cleaning activities (parsing etc.) I discovered that I have numerous columns that are just duplicates of one another (judging by basic analysis it's hundreds) but with different names.
Example: 1 column name is "things_bought_on_2021_03_07", it's duplicates have names like "things_bought_on_2021_03_07_01" and "things_bought_on_2021_03_07_02".
I know none of the ways to deal with this in Dataiku. Working on duplicate rows would be easier 😉 (I do not have duplicate rows on this one..)
@frotograf welcome to the community! You can achieve this through a Prepare Recipe. Since you are new to DSS, I would recommend utilizing the Dataiku Academy Core Designer Learning Path. The 102 course in the path goes over Prepare Recipes.
I hope this helps!
I did not think of receipt preparation. Thanks for the tip. I've been using various resources from Dataiku and I totally love it. As a non-coder (but understanding python basics) this tool has been a godsend for me!
Thank you for the feedback! Yes i am very much a clicker myself and the Prepare Recipe is a really powerful tool in DSS.
Thanks for the link @CoreyS. I've got through the academy's 101 & part of 102 but I still have to manually click on the data set. The more advanced options are for the python users and I cannot think the answer to my Q while being a no-code user.
Still the tool is fantastic and I got to remember some of the functions of the DDS - I see some new additions to the software since I started using it this year!
Could you please show me a 'fake' dataset with 3-4 cases where I can see what duplicates looks like and explain in more datails which values you want to remove? It can be sreenshot or some small portion of the data in the xlsx.
@emate thanks for the post!
It looks like that in the screenshot. The columns are the same, the rows have the same values, the only difference with them is the number on the end of the column (the ones I marked in circles).