How to get the previous year's second half data values in dataiku

Tsurapaneni
Level 3
How to get the previous year's second half data values in dataiku

Hi Team, 

I have a use case where for a given id, I have to subtract the 2nd half of the year (6months before dec) with the 1st half of the same year (jan+6months) and subtract the 1st half of the year with the 2nd half of the previous year. I have created a column to identify the first and second half of the year.

In the below example, for a given id = 1 there are 2 rows where months are immediate 6 months before to one another (jan2019 and dec2018) I gave a placeholder to indicate the rows which belongs to the months of second half of year with dec and 1st half of year with jan. The computations should be done for the rows which has the same id.  

Please do the needful !

input dataset :                                                    output to be:

id     yr          Mnth.     Value                               id     yr          Mnth.     Value  

1       2019     jan         20                                   1       2019     Jan         -20   ( 20-40)

1       2018     dec        40                                   3       2019     dec        1       (10-9)

3       2019     dec        10                                   3       2018    dec         2       (9-7)

3        2019.    jan         9

3.      2018      dec.       7

 

Thank you !

0 Kudos
3 Replies
RoyE
Dataiker

Hello,

There are a few ways to accomplish your goal to summarize the data provided into your expected dataset. 

The easiest method will be to use the Group recipe but will require one step on preparation as it only has the "SUM" function instead of subtraction. However, this recipe will be able to do exactly what you need, group the input dataset by ID column and sum the values into one row. In the case that you have more than just id 1 and 3, you will not need to modify this recipe as it will automatically detect unique values of id. The preparation step will be to turn the subtracting integers into negative numbers. Instead of A - B, it will be A + (-B).

The second method would require a custom python script to first identify the unique values of ID, save it as an array, then parse through the dataframe for contents that match the unique IDs saved earlier and subtract the values. See sample code below:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
input = dataiku.Dataset("<INPUT DATASET>")
input_df = input.get_dataframe()

#get the unique ID within dataset
unique_id = []
for ind in input_df.index:
    if input_df['id'][ind] not in unique_id:
        unique_id.append(input_df['id'][ind])

#grab the value if the same ID
for val in unique_id:
    for ind in input_df.index:
        if input_df['id'][ind] == val:
            #do the subtraction here
output df = ###

#reassign to new df and output to output dataset
output = dataiku.Dataset("<OUTPUT DATASET>")
output.write_with_schema(output_df)

I would suggest the Group recipe but there are other options to explore!

Roy 

0 Kudos
Tsurapaneni
Level 3
Author

Hi @RoyE 

Thank you for response !

The goal is not to map only the id column alone but to also map the year and the month column of that particular id. For a particular id,  it has to compare the year and month (jan indicates first half of year and dec indicates second half of year) and then subtract the value of 2nd half of year (dec 2019 at (id = 3)) with the 1st half of year (jan2019 at id = 3)and 1st half of year (jan 2019) with the 2nd half of previous year.(dec 2018(previous year)).  I had edited the question by adding 1 more row, Hope it helps for better understanding.

Please do the needful !

0 Kudos
RoyE
Dataiker

Hello,

In order to accomplish the above, you will need to use the Window Function.
https://knowledge.dataiku.com/latest/courses/visual-recipes/window.html

Select the ID as your partitioning column and select LeadDiff for the 'value' column on the Aggregation tab and you will be able to output the correct dataset.

Hope this helps!
Roy

0 Kudos