Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Here is my requirement.
I wanna remove duplicate rows based on one column. Is there any way to do in DSS recipes.
Thanks in advance 🙂
@Renga3037 in the "Distinct" visual recipe you can choose either to remove duplicates based on all columns or choose a subset including one column. If you choose to use one column it will return only that one column and just the distinct values. If you need to create some logic (like the first value based on some sort of Sort) then you should look at the Window recipe which allows you to choose First, Last, Lag, etc.
Can you be more specific
For example, assume ID is the column you want to have uniques
ID First_Name Last_Name Year_Entered
1 Lebron James 2004
2 Michael Jordan 1985
2 Larry Bird 1980
What would the dataset you returned look like based on that list?
I think it's because it keeps the first distinct ID that he sees. @GCase
Unfortunately, the distinct recipe as it is in DSS won't allow this.
Two solutions in my mind:
Convert this request in SQL , something like:
SELECT DISTINCT ON (your_column) your_table.*
ORDER BY your_column;
Add a windows recipes before and compute a column with a rank, which will be used in the distinct recipe as pre-filter.
Maybe a grouping-recipe can be of help ? Group on the ID column and set aggregations for other columns to "first". That will leave you with unique values in the ID column and everything but the first value of duplicates filtered out.
Yes, I agree with @Jurre . You achieve the task of removing duplicates based on one column while keeping all other columns data in the output dataset by using the trick she explained in "Group" recipe.
Alternatively, you can use code recipe: Remove duplicate rows in one column - Dataiku Community
For Dataiku developers, yes, I think it would be useful to update the "Distinct" recipe such as you can choose the output of recipe to easily achieve this task (as the "Distinct" recipe is what comes to my mind to do such tasks).
Just for clarification, the task is:
- find unique combinations (based on one or more columns)
- keep one row of each combination WITH ASSOCIATED data from other columns (i.e., keep the whole original row)
Welcome @Muhanned !
If you have a clear idea about improvements for the Distinct-recipe please share it on the product idea pages, we can't expect our dear DataIkers to scan every conversation for possible improvement-proposals.
Personally i'm a big fan of grouping together with options like concat and it's suboption "concat distinct". With a following prepare recipe values in such a concatenated column can be distributed again. But there are multiple ways to get some solid results.