Generating rows and 3 months avg. value
I used window visual recipes in order to get 3 months avg. value but the output wasn't what I expected. I want to get the exact 3 months avg. value but in the window recipe it calculates recent 3months with only available date so if it doesn't have data in the recent 3 months then it doesn't work.
e.g. for the 2nd row of the Input, as there are no Feb and March data, it still shows 3 months avg value is 440 but I was expecting it to be (0+0+440) / 3 = 146.66.
I have 2000ish rows including date from 2020 and the data will be getting bigger as it's generated every month. What I need to do is
1. all the columns except date and value should be grouped
2. if there's no data then it should be filled with 0
3. Need to get exact 3 months avg. value.
How can I make it? I'd prefer to have it done via SQL code, if it's better to be done via python then it would be good as well.
Input:
specialty code | product | company | type | date | value |
A | chocolate | m&m's | grocery | 2020-01-01T00:00:00.000Z | 3300 |
A | candy | m&m's | grocery | 2020-04-01T00:00:00.000Z | 440 |
B | pasta | molly | grocery | 2020-01-01T00:00:00.000Z | 1100 |
B | pasta | molly | grocery | 2020-03-01T00:00:00.000Z | 800 |
E | laptop | hp | electric | 2020-01-01T00:00:00.000Z | 8800 |
E | laptop | hp | electric | 2020-02-01T00:00:00.000Z | 11000 |
E | laptop | hp | electric | 2020-03-01T00:00:00.000Z | 8800 |
E | laptop | hp | electric | 2020-04-01T00:00:00.000Z | 7000 |
Expected Output:
specialty code | product | company | type | date | value | 3mth avg. |
A | chocolate | m&m's | grocery | 2020-01-01T00:00:00.000Z | 3300 | 1100 |
A | chocolate | m&m's | grocery | 2020-02-01T00:00:00.000Z | 0 | 1100 |
A | chocolate | m&m's | grocery | 2020-03-01T00:00:00.000Z | 0 | 1100 |
A | chocolate | m&m's | grocery | 2020-04-01T00:00:00.000Z | 0 | 0 |
A | candy | m&m's | grocery | 2020-01-01T00:00:00.000Z | 0 | 0 |
A | candy | m&m's | grocery | 2020-02-01T00:00:00.000Z | 0 | 0 |
A | candy | m&m's | grocery | 2020-03-01T00:00:00.000Z | 0 | 0 |
A | candy | m&m's | grocery | 2020-04-01T00:00:00.000Z | 440 | 146.666667 |
B | pasta | molly | grocery | 2020-01-01T00:00:00.000Z | 1100 | 366.666667 |
B | pasta | molly | grocery | 2020-02-01T00:00:00.000Z | 0 | 366.666667 |
B | pasta | molly | grocery | 2020-03-01T00:00:00.000Z | 800 | 633.333333 |
B | pasta | molly | grocery | 2020-04-01T00:00:00.000Z | 0 | 266.666667 |
E | laptop | hp | electric | 2020-01-01T00:00:00.000Z | 8800 | 2933.33333 |
E | laptop | hp | electric | 2020-02-01T00:00:00.000Z | 11000 | 6600 |
E | laptop | hp | electric | 2020-03-01T00:00:00.000Z | 8800 | 9533.33333 |
E | laptop | hp | electric | 2020-04-01T00:00:00.000Z | 7000 | 8933.33333 |
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I am not sure if the Window Recipe can deal with missing rows. But this would work if you do data densification, in other words fill the gaps that your data has in SQL.
Build a select statement with a distinct list of all products:
SELECT DISTINCT specialty, code, product, company, type FROM some_table
Now build a select statement with all the dates using the min and max from your column date as driver (ie 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, etc). I can't tell you how to do this as I don't know which SQL technology you are using but it's easy to do on most SQL technologies.Then do a cartisian join from the two select statements above adding also a dummy column called value with 0 as the value. So now you should have a query that returns every single product combination for every single date and there are no missing gaps.
Finally left join the "densification" query with your original query using all your product attributes (specialty, code, product, company, type and date) doing a group by and sum(value) to remove the 0 value rows where you already have an actual value. Finally use the Window recipe as above and it should work fine now.