Custom aggregation in Window Recipe (Fill a columns with last nonNull value)

Registered Posts: 2 ✭✭✭

Hello

I start with Dataiku and try to fill the empty lines of a column with the last non-null value taken by the column.
I work on Dataset HDFS partitioning per day.

I have already tried the visual recipe "Fill empty cell with previous value".

Unfortunately as you can see in the attached picture; I would have liked this filling to be done only under the condition of an equivalence (here on the Device_Id , the red line is what I would like to avoid)

I suspect that by doing a "Custom aggregation" in a "Windows Recipe" it should be possible. Unfortunately I can't do it.

If someone has already encountered this problem and has a solution, I would be happy to get some help.

Thanks,
AV

Best Answer

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 319 Dataiker
    edited July 2024 Answer ✓

    Hi @Avalo
    ,

    There are likely several ways to accomplish this, but I'll provide one option using a Python recipe.

    Here I created a sample dataset like you provided in your screenshot:

    Screen Shot 2021-02-04 at 4.31.50 PM.png

    I created the following python recipe and utilized the pandas groupby in combination with the fillna option to forward fill the country values within the "device_id" group. The other fill options can be found here.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    sample_data = dataiku.Dataset("sample_data")
    sample_data_df = sample_data.get_dataframe()
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    
    # Compute recipe outputs from inputs
    python_window_df = sample_data_df # For this sample code, simply copy input to output
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # this is the main transformation.  It updates the "country" column to group by the device_id column and forward fill country values within that group 
    python_window_df['country'] = python_window_df.groupby('device_id')['country'].fillna(method='ffill')
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write recipe outputs
    filled = dataiku.Dataset("filled")
    filled.write_with_schema(python_window_df)

    Here's my output dataset, where the offending row you point out is now still NULL. This could be adjusted based on the fill method.  

    Screen Shot 2021-02-04 at 4.34.41 PM.png

    Hope that helps.

    Thank you,

    Sarina

Answers

  • Registered Posts: 2 ✭✭✭

    Thank you very much for your help.

    your solution is elegant and efficient.

    Thank you,
    AV

  • Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 165 Dataiker

    Hi @Avalo
    ,

    You may also want to try the "Fill empty cells with previous/next value" processor that's available in the Prepare recipe. It will also let you do what you've described.

    Cheers,

    Ashley

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.
Top