Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Using Group Recipe Without Aggregations

7mxd
Level 2
Using Group Recipe Without Aggregations

Hello everyone,

 

I believe that python allows people to use the group_by() method without any aggregations; however, in dataiku, we must aggregate when we use the group recipe. In other words, I would like to group by a specific column and keep all other columns without aggregating, is that possible in any way? Note that I do not want to use the Python code recipe.

 

Best regards

Ahmed

 

 

0 Kudos
17 Replies
Turribeach

This may be possible with a custom aggregation. Please post an Excel with some sample dummy data and another sample on how exactly what you want to aggregate the data. Also please state where is your dataset stored (which technology).

0 Kudos
7mxd
Level 2
Author

Assuming I have the following data, and I wanted to group by the second column, namely Director.

MovieDirectorRating
JokerTodd Phillips8.4
ParasiteBong Joon-ho8.5
12 Angry MenSidney Lumet9.0
OppenheimerChristopher Nolan8.3
Shutter IslandMartin Scorsese8.2
InterstellarChristopher Nolan8.7
OkjaBong Joon-ho7.3
InceptionChristopher Nolan8.8

 

Then I will get a table with the data sorted by the Director, as follows:

MovieDirectorRating
OkjaBong Joon-ho7.3
ParasiteBong Joon-ho8.5
OppenheimerChristopher Nolan8.3
InterstellarChristopher Nolan8.7
InceptionChristopher Nolan8.8
JokerTodd Phillips8.4
12 Angry MenSidney Lumet9.0
Shutter IslandMartin Scorsese8.2

 

Any clue how I can do something similar?

Also, I would like to mention that the data I am using is imported from my files (desktop).

 

0 Kudos

In the data world what you did in the second table is not described as "grouping" but just "sorting". Grouping usually leads to a summary of the records. So for instance if you grouped your first table by Director you could see the Average Rating or the Total Number of Movies they done. If you just want to sort your data you should use the Sort recipe which doesn't need to do any grouping.

It just a matter of Python pandas that when you use group_by() without a group key it just sorts the data. This is not possible in SQL and you must always group by something if you specify a Group By clause. Of course every grouped dataset has to be sorted first so it can be properly grouped which is where the sort comes from.

0 Kudos
7mxd
Level 2
Author

Actually I would like to group my data in the way I mentioned, because I want to use the prepare recipe with "Fill empty cell with previous/next value" step. If I had my data grouped in the way I mentioned previously, then the empty cells will be filled for each group without mixing up with other groups. Is that clear? And is there any way to do that? 

 

 
0 Kudos

Hi, like I said in your example you are not grouping data but sorting data. And like I said you can easily do that with a Sort recipe the output of which you can use as an input in a Prepare recipe with the "Fill empty cell with previous/next value" processor to achieve your goal. That is the beauty of the Dataiku system, you can achieve complex transformations in a step by step easy to understand way. 

0 Kudos
7mxd
Level 2
Author

As you mentioned, I may be able to sort the data in the way I want; however, when using the "Fill empty cell with previous/next value" processor, all values will be filled. In my case, I want the values to be fille group-wise; therefore, the processor will use a value in a specific group to fill next (or previous) values in the same group without filling other groups with the same value.

Does that make sense?

0 Kudos

Can you please post a sample dataset with the values for the missing column filled so we can see how you want the aggregation to work? 

7mxd
Level 2
Author

Sure! So the following table shows a dummy data of how my table may look like with few empty cells: 

MovieDirectorRating
OkjaBong Joon-ho 
ParasiteBong Joon-ho8.5
OppenheimerChristopher Nolan 
InterstellarChristopher Nolan8.7
InceptionChristopher Nolan 
JokerTodd Phillips 
12 Angry MenSidney Lumet9.0
Shutter IslandMartin Scorsese 

 

Next, let us assume that I filled the empty cells once using the previous values, and another time using next values. Therefore, the cells will be filled in the following way using values from the same group:

MovieDirectorRating
OkjaBong Joon-ho8.5
ParasiteBong Joon-ho8.5
OppenheimerChristopher Nolan8.7
InterstellarChristopher Nolan8.7
InceptionChristopher Nolan8.7
JokerTodd Phillips 
12 Angry MenSidney Lumet9.0
Shutter IslandMartin Scorsese 

 

So we can clearly notice that missing values with no other values from the same group weren't filled. 

 

Also, as @tgb417 previously mentioned, I am currently using the Window recipe instead of the Sort recipe, and I am having the PARTINIONING COLUMNS option only toggled on.

 

I hope everything is clear alright now. Please feel free to ask any question if there is anything unclear. 

0 Kudos

Well no, it's not really clear yet even after 13 posts on this thread. For one like @LouisDHulst says it's unclear from your data sample what exactly needs to happen with the rating when it's unavailable, should it be fetched from a previous row, from the next row, from any row? Can there be more than 1 movie with rating for the same Director?

Then your chosen data sample, while I understand it's just dummy data, doesn't really help you since no one will ever assign movie ratings like this. So while I understand you don't want to share your actual dataset using some more realistic sample data and understanding what exactly you are trying to achieve and why will help a long way to have a conclusion to this thread.

Based on the little information you provided I came up with this solution which I have no way of knowing if it meets your requirements. I created a Window recipe as follows:

Screenshot 2024-05-28 at 22.28.03.png

I sorted on Rating descending to have always a Rating value first if available. I then set the aggregation on Rating to be First:

Screenshot 2024-05-28 at 22.28.21.pngWhich produces this output that matches your table:

Screenshot 2024-05-28 at 22.28.39.png 

0 Kudos
7mxd
Level 2
Author

Hi @Turribeach,

Sorry for the confusion, and thank you for your reply. I really appreciate your help.

Your way of implementation really makes sense; however, I am not really interested in sorting the values in this specific way. I also found out that if I toggled the "First/Last not null" on, then I will be able to fill the other rows, even if the few first rows were empty. However, the only disadvantage in this case is that those few first rows will be kept empty.

0 Kudos
tgb417

@7mxd  ,

In reading this post I’ve also been thinking about the window visual recipe. The window visual recipe returns all lines and adds aggregations and values.  

I don’t know how you are storing your data in Dataiku. I tend to use a SQL database.  As you may know SQL type databases do not guarantee a specific order of output results unless you use an “order by” type clause.  In my experience Dataiku makes in some was a similar assumption.  No guarantee on order unless you specifically make a request and that is usually an output or display reality.  

 In the example you are showing it appears that you want to sort by director name string, rating, and possibly movie to break any ties.  However, it is my guess that this is only an example of what you are trying to do, not necessarily the actual data.

if the Sort options suggested by @Turribeach are not helping you.  I’d invite to also look a the window visual recipe. You might be able to leave out the partition column turned off and only sort and apply aggregations.  Or pick a column that is guaranteed to have the same value on all rows.  (So that the entire dataset is the same “partition”.)

This may help a bit.

https://community.dataiku.com/t5/Switching-to-Dataiku-Forum/How-to-total-a-column/m-p/19644

Just a thought. hope this might help.  

--Tom

You raise a good point @tgb417, my solution to use a Sort recipe before would only work where the dataset storage technology can guarantee that data written in a specific order can be read in that order as an input dataset. You are correct to point out that SQL type databases do not guarantee a specific order of output results unless you use an "order by" clause. However that shouldn't be the case for most non-SQL technologies but given that the OP has not specified the dataset storage technology my answer could be incorrect/not work. 

It's worth noting that in a SQL database Dataiku dataset it's possible to use read-time ordering to request a specific order on the dataset advanced settings (see first screen shot below). And for input / external datasets the OP could also use a custom SQL Query where you can specify a sort order of the data (see second screen shot below).

I also think your suggestion of using the Window recipe might be the best option in this case. Unfortunately the OP has not shared the actual requirement from the beggining of this post, which appears to be that blanks in some column need to be filled with a previous/next value. It's not clear  however which column needs to have empty values filled, whether it needs to be the next value or the previous value nor what the right ordering should be so it's not easy to suggest a solution using the Window recipe.

I think this post is a great example on the importance of posting enough information in technical forums to allow the people trying to help to have all the data needed to propose a solution. In particular it's key that the goal/requirement is clearly specified rather than making it about the step/solution/way the OP is stuck with as there could be many other/better ways of solving that requirement. 

Screenshot 2024-05-27 at 19.09.36.png

Screenshot 2024-05-27 at 19.13.58.png

0 Kudos
7mxd
Level 2
Author

Hey Tom, 

I hope you are doing well. 

Actually in my latest comment to @Turribeach, I showed a clear example of what I am trying to do. I hope things makes sense now. By the way, I am using the window recipe with the partitions option toggled on only, but I am not getting the result I want yet.

Best regards

0 Kudos

Hi @7mxd ,

I think what @tgb417 meant was that without a fuller understanding of your data and problem it's hard to recommend a solution.

For example, in your dummy data there is always only a single rating per director, and the rows with no value for Rating are always next to a row with a rating.

Using a Window recipe with "Director" as the Partitioning column and "Movie" as the Order column, I can get the lag and lead of "Rating" to get this output:

MovieDirectorRatingRating_leadRating_lag
OkjaBong Joon-ho 8.5 
ParasiteBong Joon-ho8.5  
OppenheimerChristopher Nolan 8.7 
InterstellarChristopher Nolan8.7  
InceptionChristopher Nolan  8.7
JokerTodd Phillips   
12 Angry MenSidney Lumet9.0  
Shutter IslandMartin Scorsese   

 

I can then use a Formula processor in a Prepare recipe with the formula 

coalesce(Rating, Rating_lag, Rating_lead)

to re-create your desired output.

This method would not work if there were several empty rows in a row for the same director. It would probably not work for you if there were several ratings for each director.

How would you want to handle a director having multiple ratings? Do you just want whichever rating is closest when the data is sorted alphabetically by Director and Movie? 

--

My suggested solution would be to use a Group By recipe, aggregate on Director and then compute the average of Rating. You can then use a Join recipe to add the average rating, and even create a computed column to coalesce(Rating, Rating_avg) should you want to keep the original rating.

 

 

 

 

0 Kudos
7mxd
Level 2
Author

Hi @LouisDHulst,

I tried to implement your way; however, in my real-data, I may have two different values within the same group. In this case, I believe that generating the lag and lead columns will cause a bit confusion due to the different values. In other words, if I compared my case with the dummy data I gave as an example, then I would say that I might have different ratings within "Christopher Nolan". 

As @Turribeach mentioned, I am giving this dummy data as an example to explain a general idea of what I am trying to generate. However, my real data is completely different, but I am afraid that I cannot really share the real data.

0 Kudos

Anonymising data is trivial. All you have to do is just change values something still meaninful but that can not be traced back. And you wouldn't need too much data to show all the different scenarios you need to cover. In any case I am not convinced you really know what you want as it seems that after every response a new requirement is added to the thread. Without further information about your data and how you want to replace the missing values it will not be possible to progress this thread. 

0 Kudos

Hi @7mxd ,

The lead/lag solution was to illustrate why you need to provide better dummy/anonymized data in order to get effective help. My solution, or the solution proposed by @Turribeach, work perfectly well with your dummy data.

 

Right now it seems like you're not sure what you want to do conceptually. You've said you can have multiple values per group, but only want to have a single value. That usually means some sort of aggregation. To use the Director/Movie example: do want to see the average/median/mode rating for a Director? Do you want to get the rolling average over time? A weighted average based on box office success?  Do you just want the first/last rating?

We can help you with using Dataiku, but for that we need to know what you are trying to do, and we don't know that right now