SUMIF from Excel in DataIku

daniellemetz
daniellemetz Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3 Partner

I'm in the process of translating Excel to DataIku.

I'm struggling thinking of a straightforward way to translate SUMIFs in DataIku. Since there are many SUMIFs in the same table, I would like to stay away from having many group recipes after another.

It would be ideal if I was able to look through an entire column with the formula processor, so that I can use an if statement to make my own SUMIF structure.

Does anyone know if this is possible? Or is there another straightforward way?

Tagged:

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi @daniellemetz
    welcome to the Dataiku Community. While you wait for a more detailed response, here are some resources you may find helpful:

    I hope this helps!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @daniellemetz
    ,

    welcome to the Dataiku community, we are glad you are here.

    When reading your question, I wondered two things:

    • is your data in a SQL database?
    • do you know how to Code SQL?

    if yes, i’m wondering if some SQL could be written with a set of Common Table Expressions (CTE). I tend to use these when trying to do multiple aggregations on different columns. Although, that could also be somewhat complicated, (not as easy to implement as =SUMIF). I’m sure that my python colleagues here, have some ways to do the same thing with Pandas. Others please jump in with your ideas. I’m curious what others come up with.

    It would be helpful to see a simplified version of the problem you are trying to solve.

  • daniellemetz
    daniellemetz Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3 Partner

    Hi Tom,

    Currently I'm translating the aggregations to Pandas, since I'm more comfortable with that than SQL. As you said it's quite a complicated solution so I'm still keen to hear if there is a more elegant way in DataIku.

    To try and explain the problem in a simplified way: there are two columns, one column contains all unique values in a set and the other column contains a subset of that set (so certain values appear multiple times, certain value only ones, and other values don't appear at all). The second column has a third column next to it containing amounts of money. I would like to sum the amounts of money, corresponding to the category in the second column, to the correct value in the first column. I understand that this can be done using a group recipe with a sum aggregation, but in this excel file there are 33 columns with one or multiple SUMIF calculations that would each require their own Group recipe.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @daniellemetz

    When I hear about data sets that have multiple columns that need handling like this. In a case where similar or the same calculations are needed on an undefined set of columns that may change over time. I often think about changing data from a wide format to a long/tall format.

    https://www.statology.org/long-vs-wide-data/

    This is something I will often do with data like you are working with. Once the data is in a tall format the window functions of Dataiku can do the grouped calculations. If you need the data back in a Wide-format again to share with others then, redo the wide format. There are a bunch of visual recipe steps to do this kind of thing. zip and unzip, fold and unfold, and pivot. (Each attaches these challenges in slightly different ways.) Take a look at these to see if they might be of help.

    Here is some stuff you can use with Pandas to do these wide to long conversions:

    https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

    However, the built in visual recipes and then window or group functions can do these kinds of calculations as well.

    Give these ideas a shot and good luck. Let us all know how you get on with this.

  • daniellemetz
    daniellemetz Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3 Partner

    Hi Tom,

    Thanks for thinking along. In the end I wrote this function to simulate the SUMIF function in Excel. I'll put it here in case others find this Discussion and would like to use it. I am not the best programmer so if anyone has suggestions for improvement, feel free to leave a reply.

    def sumif(df, criteria_range, criteria, sum_range): # The df is the dataframe in which criteria_range and sum_range reside, all other variables are the same as the SUMIF function in excel for a whole column.

    # Group the sum_range on the criteria_range
    group = df.groupby(criteria_range, as_index=False)[sum_range].sum()

    # Change the column name of criteria_range to the column name of criteria, this is needed for the merge
    column_name = criteria.columns[0]
    group = group.rename({criteria_range: column_name}, axis=1)

    # Merge (left join) so that the sum_range is now in the order of criteria
    merged_df = criteria.merge(group, how='left', on=column_name)

    # Empty values are currently Nan, change this to 0 since we might need them for multiplication
    merged_df[sum_range] = merged_df[sum_range].fillna(0)

    # Return only column sum_range, it is now in the order of criteria
    return merged_df[sum_range]

Setup Info
    Tags
      Help me…