Do you know the History of Data Science? READ MORE

Best way to use Group recipe to enable double aggregation

GSung
Level 2
Best way to use Group recipe to enable double aggregation

What is the best way to use group recipe to allow double aggregation?

For example:

I have the following raw data

Col 1Col 2Total
Division 1Product 15
Division 1Product 210
Division 2Product 115
Division 2Product 420

 

I wanted to create the following results

Business CategoryExpected Total
Business 1 (Summation of Division 1 items) 5+10 = 15
Business 2 (Summation of product 1 items)5+15 = 20

 

In the group recipe, I tried to use a CASE/THEN SQL formula to create a new column to define business 1 and business 2, but this does not seem to allow the total number to be used twice and I ended up with the below result

Business CategoryExpected Total
Business 1 (Summation of Division 1 items) 5+10 = 15
Business 2 (Summation of product 1 items)15 

 

0 Kudos
7 Replies
Ignacio_Toledo

Hi @GSung,

I wasn't able to find a solution using only a group recipe... but with some extra steps I was able to produce the result you wanted.

Since it involved several steps, I put everything in a project (attached here) that you can import into your DSS instance to inspect it (Created with DSS 8, ignore any warnings about missing plugins)

Now, hopefully someone will come with a single step solution.

Cheers!

GSung
Level 2
Author

Hi @Ignacio_Toledo  - what is the best way to view your recipes and steps (I downloaded your zip file but was not sure how to view the json files)?

0 Kudos
Ignacio_Toledo

Hi @GSung. The file is a DSS project that you can import directly from Dataiku, as a zip file.

When you create a project in DSS there will be an option to "import". 

Hope this clarifies the intended use of the file! 

0 Kudos
GSung
Level 2
Author

@Ignacio_Toledo thank you, this is very useful! Is there an alternative to folding the columns, since in my dataset, I have multiple columns, and some values are repeated in the different columns.

0 Kudos
Ignacio_Toledo

If I understand correctly your question @GSung , as long as the columns have all the same pattern (Col 1, Col 2, Col 3, etc) the step doing the folding will do all of them:

Selection_015.png

 Now, if for the same row you have a value repeated on one of the columns, a bit more of work will be required to clean the duplicates. Again, if I understand well your question.

If I'm not answering your question, could you provide again a sample of your data with more columns to understand what might be the limitation?

Cheers!

Ignacio

GSung
Level 2
Author

@Ignacio_Toledo 

That is correct. Is there an other way of calculating the total without folding the columns and/or removing the duplicates?

For example, 1) there might be same values for different columns 2) the end results (For example: Business 1 could be the summation when col 1 = XYZ and col 2 = BCD or ABC; Business 2 could be the summation when col 2 = DEF)

Therefore, for the end results, I would like to be able to set conditions, while allowing dataiku to use the same row more than once, depending on the conditions

Col 1Col 2Total
XYZBCD5
XYZABC10
ABCDEF20
Ignacio_Toledo

Hi @GSung 

I'm not sure I understand how do you want to manipulate the data, maybe because I'm not familiar with the kind of data you are handling.

If you have some business defined by the value of one column, but other defined by the value of two columns, most probably you will need different steps to manipulate the dataset to get the results what you want. The example I provided first was following exactly the example you proposed, but it won't work for the new examples you propose.

I don't think there is a visual recipe in DSS that will do what you want in a single step. Only with a python recipe you could create a script to make this happen in one step.

0 Kudos
A banner prompting to get Dataiku DSS