Custom aggregation in Window Recipe (Fill a columns with last nonNull value)

Options
Avalo
Avalo Registered Posts: 2 ✭✭✭

Hello

I start with Dataiku and try to fill the empty lines of a column with the last non-null value taken by the column.
I work on Dataset HDFS partitioning per day.

I have already tried the visual recipe "Fill empty cell with previous value".

Unfortunately as you can see in the attached picture; I would have liked this filling to be done only under the condition of an equivalence (here on the Device_Id , the red line is what I would like to avoid)

I suspect that by doing a "Custom aggregation" in a "Windows Recipe" it should be possible. Unfortunately I can't do it.

If someone has already encountered this problem and has a solution, I would be happy to get some help.

Thanks,
AV

Best Answer

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    Answer ✓
    Options

    Hi @Avalo
    ,

    There are likely several ways to accomplish this, but I'll provide one option using a Python recipe.

    Here I created a sample dataset like you provided in your screenshot:

    Screen Shot 2021-02-04 at 4.31.50 PM.png

    I created the following python recipe and utilized the pandas groupby in combination with the fillna option to forward fill the country values within the "device_id" group. The other fill options can be found here.

    import dataikuimport pandas as pd, numpy as npfrom dataiku import pandasutils as pdu# Read recipe inputssample_data = dataiku.Dataset("sample_data")sample_data_df = sample_data.get_dataframe()# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE# Compute recipe outputs from inputspython_window_df = sample_data_df # For this sample code, simply copy input to output# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE# this is the main transformation. It updates the "country" column to group by the device_id column and forward fill country values within that grouppython_window_df['country'] = python_window_df.groupby('device_id')['country'].fillna(method='ffill')# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE# Write recipe outputsfilled = dataiku.Dataset("filled")filled.write_with_schema(python_window_df)

    Here's my output dataset, where the offending row you point out is now still NULL. This could be adjusted based on the fill method.  

    Screen Shot 2021-02-04 at 4.34.41 PM.png

    Hope that helps.

    Thank you,

    Sarina

Answers

  • Avalo
    Avalo Registered Posts: 2 ✭✭✭
    Options

    Thank you very much for your help.

    your solution is elegant and efficient.

    Thank you,
    AV

  • AshleyW
    AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker
    Options

    Hi @Avalo
    ,

    You may also want to try the "Fill empty cells with previous/next value" processor that's available in the Prepare recipe. It will also let you do what you've described.

    Cheers,

    Ashley

Setup Info
    Tags
      Help me…