Using Group Recipe Without Aggregations
Hello everyone,
I believe that python allows people to use the group_by() method without any aggregations; however, in dataiku, we must aggregate when we use the group recipe. In other words, I would like to group by a specific column and keep all other columns without aggregating, is that possible in any way? Note that I do not want to use the Python code recipe.
Best regards
Ahmed
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
This may be possible with a custom aggregation. Please post an Excel with some sample dummy data and another sample on how exactly what you want to aggregate the data. Also please state where is your dataset stored (which technology).
-
Assuming I have the following data, and I wanted to group by the second column, namely Director.
Movie Director Rating Joker Todd Phillips 8.4 Parasite Bong Joon-ho 8.5 12 Angry Men Sidney Lumet 9.0 Oppenheimer Christopher Nolan 8.3 Shutter Island Martin Scorsese 8.2 Interstellar Christopher Nolan 8.7 Okja Bong Joon-ho 7.3 Inception Christopher Nolan 8.8 Then I will get a table with the data sorted by the Director, as follows:
Movie Director Rating Okja Bong Joon-ho 7.3 Parasite Bong Joon-ho 8.5 Oppenheimer Christopher Nolan 8.3 Interstellar Christopher Nolan 8.7 Inception Christopher Nolan 8.8 Joker Todd Phillips 8.4 12 Angry Men Sidney Lumet 9.0 Shutter Island Martin Scorsese 8.2 Any clue how I can do something similar?
Also, I would like to mention that the data I am using is imported from my files (desktop).
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
In the data world what you did in the second table is not described as "grouping" but just "sorting". Grouping usually leads to a summary of the records. So for instance if you grouped your first table by Director you could see the Average Rating or the Total Number of Movies they done. If you just want to sort your data you should use the Sort recipe which doesn't need to do any grouping.
It just a matter of Python pandas that when you use group_by() without a group key it just sorts the data. This is not possible in SQL and you must always group by something if you specify a Group By clause. Of course every grouped dataset has to be sorted first so it can be properly grouped which is where the sort comes from.
-
Actually I would like to group my data in the way I mentioned, because I want to use the prepare recipe with "Fill empty cell with previous/next value" step. If I had my data grouped in the way I mentioned previously, then the empty cells will be filled for each group without mixing up with other groups. Is that clear? And is there any way to do that?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Hi, like I said in your example you are not grouping data but sorting data. And like I said you can easily do that with a Sort recipe the output of which you can use as an input in a Prepare recipe with the "Fill empty cell with previous/next value" processor to achieve your goal. That is the beauty of the Dataiku system, you can achieve complex transformations in a step by step easy to understand way.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
@7mxd
,In reading this post I’ve also been thinking about the window visual recipe. The window visual recipe returns all lines and adds aggregations and values.
I don’t know how you are storing your data in Dataiku. I tend to use a SQL database. As you may know SQL type databases do not guarantee a specific order of output results unless you use an “order by” type clause. In my experience Dataiku makes in some was a similar assumption. No guarantee on order unless you specifically make a request and that is usually an output or display reality.
In the example you are showing it appears that you want to sort by director name string, rating, and possibly movie to break any ties. However, it is my guess that this is only an example of what you are trying to do, not necessarily the actual data.
if the Sort options suggested by @Turribeach
are not helping you. I’d invite to also look a the window visual recipe. You might be able to leave out the partition column turned off and only sort and apply aggregations. Or pick a column that is guaranteed to have the same value on all rows. (So that the entire dataset is the same “partition”.)This may help a bit.
https://community.dataiku.com/t5/Switching-to-Dataiku-Forum/How-to-total-a-column/m-p/19644
Just a thought. hope this might help.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
You raise a good point @tgb417
, my solution to use a Sort recipe before would only work where the dataset storage technology can guarantee that data written in a specific order can be read in that order as an input dataset. You are correct to point out that SQL type databases do not guarantee a specific order of output results unless you use an "order by" clause. However that shouldn't be the case for most non-SQL technologies but given that the OP has not specified the dataset storage technology my answer could be incorrect/not work.
It's worth noting that in a SQL database Dataiku dataset it's possible to use read-time ordering to request a specific order on the dataset advanced settings (see first screen shot below). And for input / external datasets the OP could also use a custom SQL Query where you can specify a sort order of the data (see second screen shot below).I also think your suggestion of using the Window recipe might be the best option in this case. Unfortunately the OP has not shared the actual requirement from the beggining of this post, which appears to be that blanks in some column need to be filled with a previous/next value. It's not clear however which column needs to have empty values filled, whether it needs to be the next value or the previous value nor what the right ordering should be so it's not easy to suggest a solution using the Window recipe.
I think this post is a great example on the importance of posting enough information in technical forums to allow the people trying to help to have all the data needed to propose a solution. In particular it's key that the goal/requirement is clearly specified rather than making it about the step/solution/way the OP is stuck with as there could be many other/better ways of solving that requirement.
-
As you mentioned, I may be able to sort the data in the way I want; however, when using the "Fill empty cell with previous/next value" processor, all values will be filled. In my case, I want the values to be fille group-wise; therefore, the processor will use a value in a specific group to fill next (or previous) values in the same group without filling other groups with the same value.
Does that make sense?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Can you please post a sample dataset with the values for the missing column filled so we can see how you want the aggregation to work?
-
Sure! So the following table shows a dummy data of how my table may look like with few empty cells:
Movie Director Rating Okja Bong Joon-ho Parasite Bong Joon-ho 8.5 Oppenheimer Christopher Nolan Interstellar Christopher Nolan 8.7 Inception Christopher Nolan Joker Todd Phillips 12 Angry Men Sidney Lumet 9.0 Shutter Island Martin Scorsese Next, let us assume that I filled the empty cells once using the previous values, and another time using next values. Therefore, the cells will be filled in the following way using values from the same group:
Movie Director Rating Okja Bong Joon-ho 8.5 Parasite Bong Joon-ho 8.5 Oppenheimer Christopher Nolan 8.7 Interstellar Christopher Nolan 8.7 Inception Christopher Nolan 8.7 Joker Todd Phillips 12 Angry Men Sidney Lumet 9.0 Shutter Island Martin Scorsese So we can clearly notice that missing values with no other values from the same group weren't filled.
Also, as @tgb417
previously mentioned, I am currently using the Window recipe instead of the Sort recipe, and I am having the PARTINIONING COLUMNS option only toggled on.I hope everything is clear alright now. Please feel free to ask any question if there is anything unclear.
-
Hey Tom,
I hope you are doing well.
Actually in my latest comment to @Turribeach
, I showed a clear example of what I am trying to do. I hope things makes sense now. By the way, I am using the window recipe with the partitions option toggled on only, but I am not getting the result I want yet.Best regards
-
LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 54 Neuron
Hi @7mxd
,I think what @tgb417
meant was that without a fuller understanding of your data and problem it's hard to recommend a solution.For example, in your dummy data there is always only a single rating per director, and the rows with no value for Rating are always next to a row with a rating.
Using a Window recipe with "Director" as the Partitioning column and "Movie" as the Order column, I can get the lag and lead of "Rating" to get this output:
Movie Director Rating Rating_lead Rating_lag Okja Bong Joon-ho 8.5 Parasite Bong Joon-ho 8.5 Oppenheimer Christopher Nolan 8.7 Interstellar Christopher Nolan 8.7 Inception Christopher Nolan 8.7 Joker Todd Phillips 12 Angry Men Sidney Lumet 9.0 Shutter Island Martin Scorsese I can then use a Formula processor in a Prepare recipe with the formula
coalesce(Rating, Rating_lag, Rating_lead)
to re-create your desired output.
This method would not work if there were several empty rows in a row for the same director. It would probably not work for you if there were several ratings for each director.
How would you want to handle a director having multiple ratings? Do you just want whichever rating is closest when the data is sorted alphabetically by Director and Movie?
--
My suggested solution would be to use a Group By recipe, aggregate on Director and then compute the average of Rating. You can then use a Join recipe to add the average rating, and even create a computed column to coalesce(Rating, Rating_avg) should you want to keep the original rating.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Well no, it's not really clear yet even after 13 posts on this thread. For one like @LouisDHulst
says it's unclear from your data sample what exactly needs to happen with the rating when it's unavailable, should it be fetched from a previous row, from the next row, from any row? Can there be more than 1 movie with rating for the same Director?Then your chosen data sample, while I understand it's just dummy data, doesn't really help you since no one will ever assign movie ratings like this. So while I understand you don't want to share your actual dataset using some more realistic sample data and understanding what exactly you are trying to achieve and why will help a long way to have a conclusion to this thread.
Based on the little information you provided I came up with this solution which I have no way of knowing if it meets your requirements. I created a Window recipe as follows:
I sorted on Rating descending to have always a Rating value first if available. I then set the aggregation on Rating to be First:
Which produces this output that matches your table:
-
Hi @LouisDHulst
,I tried to implement your way; however, in my real-data, I may have two different values within the same group. In this case, I believe that generating the lag and lead columns will cause a bit confusion due to the different values. In other words, if I compared my case with the dummy data I gave as an example, then I would say that I might have different ratings within "Christopher Nolan".
As @Turribeach
mentioned, I am giving this dummy data as an example to explain a general idea of what I am trying to generate. However, my real data is completely different, but I am afraid that I cannot really share the real data. -
Hi @Turribeach
,Sorry for the confusion, and thank you for your reply. I really appreciate your help.
Your way of implementation really makes sense; however, I am not really interested in sorting the values in this specific way. I also found out that if I toggled the "First/Last not null" on, then I will be able to fill the other rows, even if the few first rows were empty. However, the only disadvantage in this case is that those few first rows will be kept empty.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Anonymising data is trivial. All you have to do is just change values something still meaninful but that can not be traced back. And you wouldn't need too much data to show all the different scenarios you need to cover. In any case I am not convinced you really know what you want as it seems that after every response a new requirement is added to the thread. Without further information about your data and how you want to replace the missing values it will not be possible to progress this thread.
-
LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 54 Neuron
Hi @7mxd
,The lead/lag solution was to illustrate why you need to provide better dummy/anonymized data in order to get effective help. My solution, or the solution proposed by @Turribeach
, work perfectly well with your dummy data.Right now it seems like you're not sure what you want to do conceptually. You've said you can have multiple values per group, but only want to have a single value. That usually means some sort of aggregation. To use the Director/Movie example: do want to see the average/median/mode rating for a Director? Do you want to get the rolling average over time? A weighted average based on box office success? Do you just want the first/last rating?
We can help you with using Dataiku, but for that we need to know what you are trying to do, and we don't know that right now