Using Group Recipe Without Aggregations

7mxd · May 24

Hello everyone,

I believe that python allows people to use the group_by() method without any aggregations; however, in dataiku, we must aggregate when we use the group recipe. In other words, I would like to group by a specific column and keep all other columns without aggregating, is that possible in any way? Note that I do not want to use the Python code recipe.

Best regards

Ahmed

Turribeach · May 24

This may be possible with a custom aggregation. Please post an Excel with some sample dummy data and another sample on how exactly what you want to aggregate the data. Also please state where is your dataset stored (which technology).

7mxd · May 24

Assuming I have the following data, and I wanted to group by the second column, namely Director.

Movie	Director	Rating
Joker	Todd Phillips	8.4
Parasite	Bong Joon-ho	8.5
12 Angry Men	Sidney Lumet	9.0
Oppenheimer	Christopher Nolan	8.3
Shutter Island	Martin Scorsese	8.2
Interstellar	Christopher Nolan	8.7
Okja	Bong Joon-ho	7.3
Inception	Christopher Nolan	8.8

Then I will get a table with the data sorted by the Director, as follows:

Movie	Director	Rating
Okja	Bong Joon-ho	7.3
Parasite	Bong Joon-ho	8.5
Oppenheimer	Christopher Nolan	8.3
Interstellar	Christopher Nolan	8.7
Inception	Christopher Nolan	8.8
Joker	Todd Phillips	8.4
12 Angry Men	Sidney Lumet	9.0
Shutter Island	Martin Scorsese	8.2

Any clue how I can do something similar?

Also, I would like to mention that the data I am using is imported from my files (desktop).

Turribeach · May 24

In the data world what you did in the second table is not described as "grouping" but just "sorting". Grouping usually leads to a summary of the records. So for instance if you grouped your first table by Director you could see the Average Rating or the Total Number of Movies they done. If you just want to sort your data you should use the Sort recipe which doesn't need to do any grouping.

It just a matter of Python pandas that when you use group_by() without a group key it just sorts the data. This is not possible in SQL and you must always group by something if you specify a Group By clause. Of course every grouped dataset has to be sorted first so it can be properly grouped which is where the sort comes from.

7mxd · May 27

Actually I would like to group my data in the way I mentioned, because I want to use the prepare recipe with "Fill empty cell with previous/next value" step. If I had my data grouped in the way I mentioned previously, then the empty cells will be filled for each group without mixing up with other groups. Is that clear? And is there any way to do that?

Turribeach · May 27

Hi, like I said in your example you are not grouping data but sorting data. And like I said you can easily do that with a Sort recipe the output of which you can use as an input in a Prepare recipe with the "Fill empty cell with previous/next value" processor to achieve your goal. That is the beauty of the Dataiku system, you can achieve complex transformations in a step by step easy to understand way.

tgb417 · May 27

@7mxd
,

In reading this post I’ve also been thinking about the window visual recipe. The window visual recipe returns all lines and adds aggregations and values.

I don’t know how you are storing your data in Dataiku. I tend to use a SQL database. As you may know SQL type databases do not guarantee a specific order of output results unless you use an “order by” type clause. In my experience Dataiku makes in some was a similar assumption. No guarantee on order unless you specifically make a request and that is usually an output or display reality.

In the example you are showing it appears that you want to sort by director name string, rating, and possibly movie to break any ties. However, it is my guess that this is only an example of what you are trying to do, not necessarily the actual data.

if the Sort options suggested by @Turribeach
are not helping you. I’d invite to also look a the window visual recipe. You might be able to leave out the partition column turned off and only sort and apply aggregations. Or pick a column that is guaranteed to have the same value on all rows. (So that the entire dataset is the same “partition”.)

This may help a bit.

https://community.dataiku.com/t5/Switching-to-Dataiku-Forum/How-to-total-a-column/m-p/19644

Just a thought. hope this might help.

Turribeach · May 27

You raise a good point @tgb417
, my solution to use a Sort recipe before would only work where the dataset storage technology can guarantee that data written in a specific order can be read in that order as an input dataset. You are correct to point out that SQL type databases do not guarantee a specific order of output results unless you use an "order by" clause. However that shouldn't be the case for most non-SQL technologies but given that the OP has not specified the dataset storage technology my answer could be incorrect/not work.

It's worth noting that in a SQL database Dataiku dataset it's possible to use read-time ordering to request a specific order on the dataset advanced settings (see first screen shot below). And for input / external datasets the OP could also use a custom SQL Query where you can specify a sort order of the data (see second screen shot below).

I also think your suggestion of using the Window recipe might be the best option in this case. Unfortunately the OP has not shared the actual requirement from the beggining of this post, which appears to be that blanks in some column need to be filled with a previous/next value. It's not clear however which column needs to have empty values filled, whether it needs to be the next value or the previous value nor what the right ordering should be so it's not easy to suggest a solution using the Window recipe.

I think this post is a great example on the importance of posting enough information in technical forums to allow the people trying to help to have all the data needed to propose a solution. In particular it's key that the goal/requirement is clearly specified rather than making it about the step/solution/way the OP is stuck with as there could be many other/better ways of solving that requirement.

Screenshot 2024-05-27 at 19.09.36.png

Screenshot 2024-05-27 at 19.13.58.png

7mxd · May 28

As you mentioned, I may be able to sort the data in the way I want; however, when using the "Fill empty cell with previous/next value" processor, all values will be filled. In my case, I want the values to be fille group-wise; therefore, the processor will use a value in a specific group to fill next (or previous) values in the same group without filling other groups with the same value.

Does that make sense?

Turribeach · May 28

Can you please post a sample dataset with the values for the missing column filled so we can see how you want the aggregation to work?

7mxd · May 28

Sure! So the following table shows a dummy data of how my table may look like with few empty cells:

Movie	Director	Rating
Okja	Bong Joon-ho
Parasite	Bong Joon-ho	8.5
Oppenheimer	Christopher Nolan
Interstellar	Christopher Nolan	8.7
Inception	Christopher Nolan
Joker	Todd Phillips
12 Angry Men	Sidney Lumet	9.0
Shutter Island	Martin Scorsese

Next, let us assume that I filled the empty cells once using the previous values, and another time using next values. Therefore, the cells will be filled in the following way using values from the same group:

Movie	Director	Rating
Okja	Bong Joon-ho	8.5
Parasite	Bong Joon-ho	8.5
Oppenheimer	Christopher Nolan	8.7
Interstellar	Christopher Nolan	8.7
Inception	Christopher Nolan	8.7
Joker	Todd Phillips
12 Angry Men	Sidney Lumet	9.0
Shutter Island	Martin Scorsese

So we can clearly notice that missing values with no other values from the same group weren't filled.

Also, as @tgb417
previously mentioned, I am currently using the Window recipe instead of the Sort recipe, and I am having the PARTINIONING COLUMNS option only toggled on.

I hope everything is clear alright now. Please feel free to ask any question if there is anything unclear.

7mxd · May 28

Hey Tom,

I hope you are doing well.

Actually in my latest comment to @Turribeach
, I showed a clear example of what I am trying to do. I hope things makes sense now. By the way, I am using the window recipe with the partitions option toggled on only, but I am not getting the result I want yet.

Best regards

LouisDHulst · May 28

Hi @7mxd
,

I think what @tgb417
meant was that without a fuller understanding of your data and problem it's hard to recommend a solution.

For example, in your dummy data there is always only a single rating per director, and the rows with no value for Rating are always next to a row with a rating.

Using a Window recipe with "Director" as the Partitioning column and "Movie" as the Order column, I can get the lag and lead of "Rating" to get this output:

Movie	Director	Rating	Rating_lead	Rating_lag
Okja	Bong Joon-ho		8.5
Parasite	Bong Joon-ho	8.5
Oppenheimer	Christopher Nolan		8.7
Interstellar	Christopher Nolan	8.7
Inception	Christopher Nolan			8.7
Joker	Todd Phillips
12 Angry Men	Sidney Lumet	9.0
Shutter Island	Martin Scorsese

I can then use a Formula processor in a Prepare recipe with the formula

coalesce(Rating, Rating_lag, Rating_lead)

to re-create your desired output.

This method would not work if there were several empty rows in a row for the same director. It would probably not work for you if there were several ratings for each director.

How would you want to handle a director having multiple ratings? Do you just want whichever rating is closest when the data is sorted alphabetically by Director and Movie?

--

My suggested solution would be to use a Group By recipe, aggregate on Director and then compute the average of Rating. You can then use a Join recipe to add the average rating, and even create a computed column to coalesce(Rating, Rating_avg) should you want to keep the original rating.

Turribeach · May 28

Well no, it's not really clear yet even after 13 posts on this thread. For one like @LouisDHulst
says it's unclear from your data sample what exactly needs to happen with the rating when it's unavailable, should it be fetched from a previous row, from the next row, from any row? Can there be more than 1 movie with rating for the same Director?

Then your chosen data sample, while I understand it's just dummy data, doesn't really help you since no one will ever assign movie ratings like this. So while I understand you don't want to share your actual dataset using some more realistic sample data and understanding what exactly you are trying to achieve and why will help a long way to have a conclusion to this thread.

Based on the little information you provided I came up with this solution which I have no way of knowing if it meets your requirements. I created a Window recipe as follows:

Screenshot 2024-05-28 at 22.28.03.png

I sorted on Rating descending to have always a Rating value first if available. I then set the aggregation on Rating to be First:

Screenshot 2024-05-28 at 22.28.21.png Which produces this output that matches your table:

Screenshot 2024-05-28 at 22.28.39.png

7mxd · May 29

Hi @LouisDHulst
,

I tried to implement your way; however, in my real-data, I may have two different values within the same group. In this case, I believe that generating the lag and lead columns will cause a bit confusion due to the different values. In other words, if I compared my case with the dummy data I gave as an example, then I would say that I might have different ratings within "Christopher Nolan".

As @Turribeach
mentioned, I am giving this dummy data as an example to explain a general idea of what I am trying to generate. However, my real data is completely different, but I am afraid that I cannot really share the real data.

7mxd · May 29

Hi @Turribeach
,

Sorry for the confusion, and thank you for your reply. I really appreciate your help.

Your way of implementation really makes sense; however, I am not really interested in sorting the values in this specific way. I also found out that if I toggled the "First/Last not null" on, then I will be able to fill the other rows, even if the few first rows were empty. However, the only disadvantage in this case is that those few first rows will be kept empty.

Turribeach · May 29

Anonymising data is trivial. All you have to do is just change values something still meaninful but that can not be traced back. And you wouldn't need too much data to show all the different scenarios you need to cover. In any case I am not convinced you really know what you want as it seems that after every response a new requirement is added to the thread. Without further information about your data and how you want to replace the missing values it will not be possible to progress this thread.

LouisDHulst · May 29

Hi @7mxd
,

The lead/lag solution was to illustrate why you need to provide better dummy/anonymized data in order to get effective help. My solution, or the solution proposed by @Turribeach
, work perfectly well with your dummy data.

Right now it seems like you're not sure what you want to do conceptually. You've said you can have multiple values per group, but only want to have a single value. That usually means some sort of aggregation. To use the Director/Movie example: do want to see the average/median/mode rating for a Director? Do you want to get the rolling average over time? A weighted average based on box office success? Do you just want the first/last rating?

We can help you with using Dataiku, but for that we need to know what you are trying to do, and we don't know that right now

Using Group Recipe Without Aggregations

Answers

Categories

Setup Info

Tags