How to get the previous year's second half data values in dataiku

Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi Team,

I have a use case where for a given id, I have to subtract the 2nd half of the year (6months before dec) with the 1st half of the same year (jan+6months) and subtract the 1st half of the year with the 2nd half of the previous year. I have created a column to identify the first and second half of the year.

In the below example, for a given id = 1 there are 2 rows where months are immediate 6 months before to one another (jan2019 and dec2018) I gave a placeholder to indicate the rows which belongs to the months of second half of year with dec and 1st half of year with jan. The computations should be done for the rows which has the same id.

Please do the needful !

input dataset : output to be:

id yr Mnth. Value id yr Mnth. Value

1 2019 jan 20 1 2019 Jan -20 ( 20-40)

1 2018 dec 40 3 2019 dec 1 (10-9)

3 2019 dec 10 3 2018 dec 2 (9-7)

3 2019. jan 9

3. 2018 dec. 7

Thank you !

Answers

  • RoyE
    RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker
    edited July 17

    Hello,

    There are a few ways to accomplish your goal to summarize the data provided into your expected dataset.

    The easiest method will be to use the Group recipe but will require one step on preparation as it only has the "SUM" function instead of subtraction. However, this recipe will be able to do exactly what you need, group the input dataset by ID column and sum the values into one row. In the case that you have more than just id 1 and 3, you will not need to modify this recipe as it will automatically detect unique values of id. The preparation step will be to turn the subtracting integers into negative numbers. Instead of A - B, it will be A + (-B).

    The second method would require a custom python script to first identify the unique values of ID, save it as an array, then parse through the dataframe for contents that match the unique IDs saved earlier and subtract the values. See sample code below:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    input = dataiku.Dataset("<INPUT DATASET>")
    input_df = input.get_dataframe()
    
    #get the unique ID within dataset
    unique_id = []
    for ind in input_df.index:
        if input_df['id'][ind] not in unique_id:
            unique_id.append(input_df['id'][ind])
    
    #grab the value if the same ID
    for val in unique_id:
        for ind in input_df.index:
            if input_df['id'][ind] == val:
                #do the subtraction here
    output df = ###
    
    #reassign to new df and output to output dataset
    output = dataiku.Dataset("<OUTPUT DATASET>")
    output.write_with_schema(output_df)
    

    I would suggest the Group recipe but there are other options to explore!

    Roy

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi @RoyE

    Thank you for response !

    The goal is not to map only the id column alone but to also map the year and the month column of that particular id. For a particular id, it has to compare the year and month (jan indicates first half of year and dec indicates second half of year) and then subtract the value of 2nd half of year (dec 2019 at (id = 3)) with the 1st half of year (jan2019 at id = 3)and 1st half of year (jan 2019) with the 2nd half of previous year.(dec 2018(previous year)). I had edited the question by adding 1 more row, Hope it helps for better understanding.

    Please do the needful !

  • RoyE
    RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker

    Hello,

    In order to accomplish the above, you will need to use the Window Function.
    https://knowledge.dataiku.com/latest/courses/visual-recipes/window.html

    Select the ID as your partitioning column and select LeadDiff for the 'value' column on the Aggregation tab and you will be able to output the correct dataset.

    Hope this helps!
    Roy

Setup Info
    Tags
      Help me…