How to reproduce a groupBy with selected values without aggregations ?

Setup Info
Registered Posts: 3 ✭✭
edited March 20 in Using Dataiku

SELECT DISTINCT
N1.COLUMN1 as "Column 1",
max(N1.COLUMN2) as "Column 2"

FROM DB

GROUP BY N1.COLUMN1 ORDER BY N1.COLUMN1


Hello, I'm trying to reproduce an sql script with Dataiku recipes, and in that script, there is a groupBy with selected columns with no aggregation, how do I reproduce this with Dataiku recipes and PostgreSQL ? The issue is that I don't know what is the default aggregation SQLite used in the case above.

In that case, how to get "N1.COLUMN1" like SQLite would ?

If someone has the answer it would really help !

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 5 Dataiker

    Hi @cretois!

    If I understand your question correctly, I believe the Distinct or Group Recipes could accomplish what you want.

    For a group by with selected columns and no aggregation, use the Distinct Recipe:
    1. Highlight your dataset in the flow and select the Distinct Recipe from the Action menu
    2. Enter your table name and choose your data connection; here you could select a PostgreSQL connection
    3. Configure the recipe by choosing which columns you want and then run it

    My example output is just a table with the distinct values of "week"


    For a group by with aggregations, use the Group Recipe:
    1. Repeat steps from above just choose the Group Recipe and select a column in the Group By dropdown
    2. Configure the recipe by choosing your desired aggregations and run it


    My example output is just a table with the maximum "home_pts" for each "week"

  • Registered Posts: 3 ✭✭

    Thank you for the quick response ! The issue is that the GroupBy I want to reproduce select a value WITHOUT aggregation, and some other WITH aggregation, and with Dataiku I don't know how to reproduce the "without" aggregation, I have tried with "FIRST", "LAST", "MIN" and "MAX" (it's a text) and I never get the same results as the SQL script :

    SELECT DISTINCT
    N1.COLUMN1 as "Column 1",
    max(N1.COLUMN2) as "Column 2" FROM DB GROUP BY N1.COLUMN1 ORDER BY N1.COLUMN1


    In the example above, there is no aggregation on N1.COLUMN1, and that's what I want to reproduce

  • Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,344 Neuron

    I don't really understand what the issue is here. If you use a Group recipe then select to Group By COLUMN1 and select Max of COLUMN2 and it would do what you want. Otherwise post the a screen shot of the Group section of the Group recipe and why you think there is an issue with the output. Here is how my SQL query looks like. The DISTINCT clause in your sample SQL is redundant and not really required. If you group by a column you will get distinct values.

  • Registered Posts: 3 ✭✭

    Sorry you're right, it was a wrong example, I'll show you with a more precise one :

    SELECT DISTINCT  

    N1.COLUMN1 as "Column 1",
    max(N1.COLUMN2) as "Column 2", N1.COLUMN3 as "Column 3",

    FROM DB
    GROUP BY N1.COLUMN1
    ORDER BY N1.COLUMN1


    Forgot that you can select COLUMN1 because it's the Group Key, in that case how do I get COLUMN3 without aggregation ?

  • Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,344 Neuron
    edited 4:18PM

    This is not a valid SQL (at least on regular SQL engines like Oracle, MS-SQL, MySQL). You must either add a column with an aggregate function or include it in the Group By clause. So either add COLUMN3 as a second group column or add it with an aggregate function. Those are your choices.

    Will produce this SQL:

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 5 Dataiker

    Hi @cretois,

    I believe @Turribeach is correct in that what you're asking would be result in invalid SQL.

    You can accomplish what you want by using the Group recipe followed by a Join recipe.
    1. Use Group to get your aggregations (from your pseudo code, group by "Column1" and calculate max of "Column2")
    2. Use Join to combine your aggregations with your non-aggregated columns from the original dataset (join on "Column1" and include "Column2_max" & "Column3")

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.