Median on grouping data

Options
PANKAJ
PANKAJ Partner, L2 Admin, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 26 Partner

I suggest Dataiku add a feature to calculate median easily, although it is available inside 'Analyze' for numeric columns.

But if someone wants to find the median of each group after grouping, like we can find min, max & avg, etc.

Adding a median feature after grouping data will definitely help the user by giving them a smooth experience.

10
10 votes

In the Backlog · Last Updated

Comments

  • AshleyW
    AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker
    Options

    Hi @PANKAJ
    ,

    Thanks for submitting this idea. You'll be pleased to hear this idea is in our backlog--adding median to the list of available aggregation--and we are determining the next steps for development. We can't provide a timeline at this point, but be sure to check back for updates.

    For everyone else, kudos the original post to signal that you're interested in Dataiku developing and releasing this feature!

    Take care,

    Ashley

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @AshleyW
    & @PANKAJ
    ,

    If we are going to add Median, what about Mode as well?

    Second, while we are waiting for this feature.

    Has anyone worked out the syntax to add these calculations to a window recipe as a custom aggregation? I'm running Postgres as the underlying DBMS.

  • natejgardner
    natejgardner Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 151 Neuron
    edited July 17
    Options

    As long as we're adding averages, geometric and harmonic means and RMS would be nice to have also.

    @tgb417
    for Postgres, try

    percentile_cont(0.5) WITHIN GROUP (ORDER BY num_value) as median_value,
    mode() WITHIN GROUP (ORDER BY num_value) as modal_value

    For geometric and harmonic means in Postgres, see this link

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @natejgardner
    ,

    I've figured out how to do an Average as a custom aggregation using the following.

    Custom Average.jpg

    Note that one needs to drop the explicit column alias

    as custome_AVG_1

    If you call out the column name in the custom in the SQL chunk you provide for the custom aggregation you break things.

    However, I've not yet figured out how to do the precentile_con(0.5) and mode(). For example, if I try something like this:

    Custom Mode.jpg

    I get an error message. In a detailed search of the dataiku.com website, I've not found a solution.

  • natejgardner
    natejgardner Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 151 Neuron
    Options

    @tgb417
    it looks like Postgres doesn't directly support these as window functions, but you can work around it with a join: https://stackoverflow.com/questions/39683330/percentile-calculation-with-a-window-function

  • Hayet
    Hayet Registered Posts: 1 ✭✭✭
    Options

    Another workaround :

    1. In the group by you can concatenate the values of interest.

    2. Then in a prepare recipe, with a python step, you can transform this concatenation into a python list (with library ast), then compute median on this list (with library statistics)

  • 9squirrels
    9squirrels Registered Posts: 1
    Options

    I'm guessing that Median is "hard" because it always seems to be overlooked. I'd go further than just Median though and add in things like P5, P10, P25 etc for the various important percentiles.

Setup Info
    Tags
      Help me…