Best way to use Group recipe to enable double aggregation

Options
GSung
GSung Registered Posts: 27 ✭✭✭✭

What is the best way to use group recipe to allow double aggregation?

For example:

I have the following raw data

Col 1Col 2Total
Division 1Product 15
Division 1Product 210
Division 2Product 115
Division 2Product 420

I wanted to create the following results

Business CategoryExpected Total
Business 1 (Summation of Division 1 items) 5+10 = 15
Business 2 (Summation of product 1 items)5+15 = 20

In the group recipe, I tried to use a CASE/THEN SQL formula to create a new column to define business 1 and business 2, but this does not seem to allow the total number to be used twice and I ended up with the below result

Business CategoryExpected Total
Business 1 (Summation of Division 1 items) 5+10 = 15
Business 2 (Summation of product 1 items)15

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @GSung
    ,

    I wasn't able to find a solution using only a group recipe... but with some extra steps I was able to produce the result you wanted.

    Since it involved several steps, I put everything in a project (attached here) that you can import into your DSS instance to inspect it (Created with DSS 8, ignore any warnings about missing plugins)

    Now, hopefully someone will come with a single step solution.

    Cheers!

  • GSung
    GSung Registered Posts: 27 ✭✭✭✭
    Options

    Hi @Ignacio_Toledo
    - what is the best way to view your recipes and steps (I downloaded your zip file but was not sure how to view the json files)?

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @GSung
    . The file is a DSS project that you can import directly from Dataiku, as a zip file.

    When you create a project in DSS there will be an option to "import".

    Hope this clarifies the intended use of the file!

  • GSung
    GSung Registered Posts: 27 ✭✭✭✭
    Options

    @Ignacio_Toledo
    thank you, this is very useful! Is there an alternative to folding the columns, since in my dataset, I have multiple columns, and some values are repeated in the different columns.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    If I understand correctly your question @GSung
    , as long as the columns have all the same pattern (Col 1, Col 2, Col 3, etc) the step doing the folding will do all of them:

    Selection_015.png

     Now, if for the same row you have a value repeated on one of the columns, a bit more of work will be required to clean the duplicates. Again, if I understand well your question.

    If I'm not answering your question, could you provide again a sample of your data with more columns to understand what might be the limitation?

    Cheers!

    Ignacio

  • GSung
    GSung Registered Posts: 27 ✭✭✭✭
    Options

    @Ignacio_Toledo

    That is correct. Is there an other way of calculating the total without folding the columns and/or removing the duplicates?

    For example, 1) there might be same values for different columns 2) the end results (For example: Business 1 could be the summation when col 1 = XYZ and col 2 = BCD or ABC; Business 2 could be the summation when col 2 = DEF)

    Therefore, for the end results, I would like to be able to set conditions, while allowing dataiku to use the same row more than once, depending on the conditions

    Col 1Col 2Total
    XYZBCD5
    XYZABC10
    ABCDEF20
  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @GSung

    I'm not sure I understand how do you want to manipulate the data, maybe because I'm not familiar with the kind of data you are handling.

    If you have some business defined by the value of one column, but other defined by the value of two columns, most probably you will need different steps to manipulate the dataset to get the results what you want. The example I provided first was following exactly the example you proposed, but it won't work for the new examples you propose.

    I don't think there is a visual recipe in DSS that will do what you want in a single step. Only with a python recipe you could create a script to make this happen in one step.

Setup Info
    Tags
      Help me…