Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Distinct recipe

Solved!
Erlebacher
Level 4
Distinct recipe

I have a dataset with 10 columns, and would like to remove rows that differ only in the first two columns. So I use the "Distinct Recipe" for this task. I get the result I want, except that the output file only contains the two columns I selected. I would like to keep all the columns. How can I accomplish this? Thanks for any insight.


Operating system used: Mac Ventura

0 Kudos
2 Solutions
tgb417

If you have a primary key like a customer number or row number on your dataset you can use a join recipe to re-connect the full dataset to the subset you want to keep. The problem with this is often that there are different values in the two rows or more rows you want to connect together.

Here is another post that might help about record deduplication (or record linkage and clustering). https://community.dataiku.com/t5/Using-Dataiku/How-to-identify-duplicates-in-a-data-set/td-p/25823

However, many of those techniques are more advanced.

I often use a partitioned sorted windowed recipe to capture one of the rows of a group of related rows. Which sounds like the problem you are trying to solve.

Here are some quick images of the setup. (My use case here is finding the most recent copy of a record that I've pulled from a REST API with the API Connect plugging. But there are a bunch of other use cases like this.)

Rather than using the Distinct or Group Recipe, try the Window Recipie.

Window Recipie to find a single row for a key or set of keysWindow Recipie to find a single row for a key or set of keys

The partition columns is the column or columns you want to find a unique record for. In my case this is just one column. In you case you would use your two columns you have been using in distinct recipe.

The Sort columns are used to pick the record you will want to keep. You want the record you are going to "keep" to be the first record in the partition. So sort order matters. In my case I want the most recent record, and the most recent time I've fetched it. Your sorting order might be different. The key point here is that the first row in each partition is the one in list that will be kept.

Window Recipe showing the columns you want to keep.Window Recipe showing the columns you want to keep.

On the Aggregations tab, you mark any of the columns you want to keep in the final dataset. We also want to Have a column added for a Row number in each partition. We do this by selecting the Row number option at the top of this screen. The #1 record in each partition is the record we will want to keep. (The remainder of the records we will filter out on the next screen.)

Now here is where the magic comes in. We are only going to keep the first row of each partition.

Window Recipe filtering out the rows in the partition that we do not wish to keep.Window Recipe filtering out the rows in the partition that we do not wish to keep.

So we are going to do a Post-filter. The new Row number column was added to your dataset by checking the computer rank "Row number" on the Aggregations tab. We are going to filter to just keep the row that ends up at the top of the partition. Rownumber == 1.

Run the recipe and you will end up with just one row from each partition. And you can do some other creative stuff with the aggregations should you want to minimum and max of the values in each of the little sets of records.

The only down side sort of is that you are left with a rownumber column on each record. This can be removed with a simple prepare recipe as the next step in your process.

Hope this helps. I use this trick quite often.

--Tom

View solution in original post

0 Kudos
Erlebacher
Level 4
Author

Excellent trick to be sure! Nonetheless, it is surprising to me that Dataiku would not have added a simple option in the "distinct" recipe that says "keep all columns". Simple omission, or was there a clear reason for the decision? I wonder.

View solution in original post

0 Kudos
7 Replies
tgb417

If you have a primary key like a customer number or row number on your dataset you can use a join recipe to re-connect the full dataset to the subset you want to keep. The problem with this is often that there are different values in the two rows or more rows you want to connect together.

Here is another post that might help about record deduplication (or record linkage and clustering). https://community.dataiku.com/t5/Using-Dataiku/How-to-identify-duplicates-in-a-data-set/td-p/25823

However, many of those techniques are more advanced.

I often use a partitioned sorted windowed recipe to capture one of the rows of a group of related rows. Which sounds like the problem you are trying to solve.

Here are some quick images of the setup. (My use case here is finding the most recent copy of a record that I've pulled from a REST API with the API Connect plugging. But there are a bunch of other use cases like this.)

Rather than using the Distinct or Group Recipe, try the Window Recipie.

Window Recipie to find a single row for a key or set of keysWindow Recipie to find a single row for a key or set of keys

The partition columns is the column or columns you want to find a unique record for. In my case this is just one column. In you case you would use your two columns you have been using in distinct recipe.

The Sort columns are used to pick the record you will want to keep. You want the record you are going to "keep" to be the first record in the partition. So sort order matters. In my case I want the most recent record, and the most recent time I've fetched it. Your sorting order might be different. The key point here is that the first row in each partition is the one in list that will be kept.

Window Recipe showing the columns you want to keep.Window Recipe showing the columns you want to keep.

On the Aggregations tab, you mark any of the columns you want to keep in the final dataset. We also want to Have a column added for a Row number in each partition. We do this by selecting the Row number option at the top of this screen. The #1 record in each partition is the record we will want to keep. (The remainder of the records we will filter out on the next screen.)

Now here is where the magic comes in. We are only going to keep the first row of each partition.

Window Recipe filtering out the rows in the partition that we do not wish to keep.Window Recipe filtering out the rows in the partition that we do not wish to keep.

So we are going to do a Post-filter. The new Row number column was added to your dataset by checking the computer rank "Row number" on the Aggregations tab. We are going to filter to just keep the row that ends up at the top of the partition. Rownumber == 1.

Run the recipe and you will end up with just one row from each partition. And you can do some other creative stuff with the aggregations should you want to minimum and max of the values in each of the little sets of records.

The only down side sort of is that you are left with a rownumber column on each record. This can be removed with a simple prepare recipe as the next step in your process.

Hope this helps. I use this trick quite often.

--Tom
0 Kudos
Erlebacher
Level 4
Author

Excellent trick to be sure! Nonetheless, it is surprising to me that Dataiku would not have added a simple option in the "distinct" recipe that says "keep all columns". Simple omission, or was there a clear reason for the decision? I wonder.

0 Kudos
tgb417

@Erlebacher ,

I may still be unclear about what you are trying to achieve.

From my point of view if you are doing distinct on a sub-set of columns. You are looking for the unique values for those columns. (A smaller subset of rows.) and a smaller set of columns. However, If you want to include the remaining column you would have to decide what to do with those extra values, each row of the group of rows of data might have unique values.

You might find this approach more helpful. Using the Group by Recipe, choose all of the columns you want to be distinct. If you want values from other columns you can either choose not to include them at all as in the error column below or use the concatenate option, as in all of the other columns. The options that are available when you include the concatenated all values of the group. Or just the distinct values of the group.

Dataiku Group by Recipie with the Group Key of multiple columns and other column values concatenated into columnsDataiku Group by Recipie with the Group Key of multiple columns and other column values concatenated into columns

If these basic visual options don't fully meet your needs. You can also try SQL, Python, or R recipes. These options provide almost an unlimited types of results.

If you would like to discuss this further, may I invite you to include a simple example of the kind of data you are starting with? And some examples of the way you would like the results to end up.

Thanks for being a member of our community.

--Tom
0 Kudos
Erlebacher
Level 4
Author

Hi Tom,

Again, excellent suggestion.

Here is a Python example of what I was seeking (which your Windows solution, or a group recipe combined with a join (less desirable), would solve. I have already implemented the solution with the Windows recipe. I seek elegance as well as functionality. I am running from the command line with the prompt `>>>`:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[3,3,4,6], 'B':[4,4,7,9], 'C':[3,6,2,3], 'D':['a','b','c','d']})
>>> df
A B C D
0 3 4 3 a
1 3 4 6 b
2 4 7 2 c
3 6 9 3 d
>>> df.drop_duplicates(['A','B'], keep='first')
A B C D
0 3 4 3 a
2 4 7 2 c
3 6 9 3 d
I do like the `concat` option in the group recipe. I have done that in pandas using list aggregation, which is a little tricky. I have one question though. Given the simple example:

A, B
3, ['a','b','c']
6, ['a','c']
From the dataframe immediately above, I would like to create a database with one of the two structures (I would like to know how to create both):

First:

A, 'a', 'b', 'c'
3, T, T, T
6, T, F, T
Second,

A, Letter
3, 'a'
3, 'b'
3, 'c'
6, 'a'
6, 'c'
This is possible in Pandas (although I forget how to do it; probably with pivots, stack, unstack, but I am not sure. Thanks!

(I tried to use code blocks, but I get the message that I have invalid HTML in my message, and I cannot get rid of it.)

0 Kudos
Erlebacher
Level 4
Author


I found the answer to my question. The prepare step: "split and fold" will definitely do my second case. (I tried it out 🙂 and it worked!). Thanks for your insights!

0 Kudos
tgb417

@Erlebacher ,

Given that you know some python, you could also look at using a python notebook / python recipe step.

Thanks for sharing.

--Tom
0 Kudos
Erlebacher
Level 4
Author

Thanks. My original code was written in python before I knew of Dataiku. My collaborators suggested I do do as much as possible through Dataiku tools.