Generating rows and 3 months avg. value

Options
ShrimpMania
ShrimpMania Registered Posts: 15

I used window visual recipes in order to get 3 months avg. value but the output wasn't what I expected. I want to get the exact 3 months avg. value but in the window recipe it calculates recent 3months with only available date so if it doesn't have data in the recent 3 months then it doesn't work.

e.g. for the 2nd row of the Input, as there are no Feb and March data, it still shows 3 months avg value is 440 but I was expecting it to be (0+0+440) / 3 = 146.66.

I have 2000ish rows including date from 2020 and the data will be getting bigger as it's generated every month. What I need to do is

1. all the columns except date and value should be grouped

2. if there's no data then it should be filled with 0

3. Need to get exact 3 months avg. value.

How can I make it? I'd prefer to have it done via SQL code, if it's better to be done via python then it would be good as well.

Input:

specialty codeproductcompanytypedatevalue
Achocolatem&m'sgrocery2020-01-01T00:00:00.000Z3300
Acandym&m'sgrocery2020-04-01T00:00:00.000Z440
Bpastamollygrocery2020-01-01T00:00:00.000Z1100
Bpastamollygrocery2020-03-01T00:00:00.000Z800
Elaptophpelectric2020-01-01T00:00:00.000Z8800
Elaptophpelectric2020-02-01T00:00:00.000Z11000
Elaptophpelectric2020-03-01T00:00:00.000Z8800
Elaptophpelectric2020-04-01T00:00:00.000Z7000

Expected Output:

specialty codeproductcompanytypedatevalue3mth avg.
Achocolatem&m'sgrocery2020-01-01T00:00:00.000Z33001100
Achocolatem&m'sgrocery2020-02-01T00:00:00.000Z01100
Achocolatem&m'sgrocery2020-03-01T00:00:00.000Z01100
Achocolatem&m'sgrocery2020-04-01T00:00:00.000Z00
Acandym&m'sgrocery2020-01-01T00:00:00.000Z00
Acandym&m'sgrocery2020-02-01T00:00:00.000Z00
Acandym&m'sgrocery2020-03-01T00:00:00.000Z00
Acandym&m'sgrocery2020-04-01T00:00:00.000Z440146.666667
Bpastamollygrocery2020-01-01T00:00:00.000Z1100366.666667
Bpastamollygrocery2020-02-01T00:00:00.000Z0366.666667
Bpastamollygrocery2020-03-01T00:00:00.000Z800633.333333
Bpastamollygrocery2020-04-01T00:00:00.000Z0266.666667
Elaptophpelectric2020-01-01T00:00:00.000Z88002933.33333
Elaptophpelectric2020-02-01T00:00:00.000Z110006600
Elaptophpelectric2020-03-01T00:00:00.000Z88009533.33333
Elaptophpelectric2020-04-01T00:00:00.000Z70008933.33333

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,737 Neuron
    Answer ✓
    Options

    I am not sure if the Window Recipe can deal with missing rows. But this would work if you do data densification, in other words fill the gaps that your data has in SQL.

    Build a select statement with a distinct list of all products:

    SELECT DISTINCT specialty, code, product, company, type FROM some_table

    Now build a select statement with all the dates using the min and max from your column date as driver (ie 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, etc). I can't tell you how to do this as I don't know which SQL technology you are using but it's easy to do on most SQL technologies.

    Then do a cartisian join from the two select statements above adding also a dummy column called value with 0 as the value. So now you should have a query that returns every single product combination for every single date and there are no missing gaps.

    Finally left join the "densification" query with your original query using all your product attributes (specialty, code, product, company, type and date) doing a group by and sum(value) to remove the 0 value rows where you already have an actual value. Finally use the Window recipe as above and it should work fine now.

Setup Info
    Tags
      Help me…