SUMIF from Excel in DataIku

daniellemetz
Level 2
SUMIF from Excel in DataIku

I'm in the process of translating Excel to DataIku.

I'm struggling thinking of a straightforward way to translate SUMIFs in DataIku. Since there are many SUMIFs in the same table, I would like to stay away from having many group recipes after another.

It would be ideal if I was able to look through an entire column with the formula processor, so that I can use an if statement to make my own SUMIF structure.

Does anyone know if this is possible? Or is there another straightforward way?

0 Kudos
5 Replies
CoreyS
Dataiker Alumni

Hi @daniellemetz welcome to the Dataiku Community. While you wait for a more detailed response, here are some resources you may find helpful:

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos
tgb417

@daniellemetz ,

welcome to the Dataiku community, we are glad you are here.

When reading your question, I wondered two things:

  • is your data in a SQL database?
  • do you know how to Code SQL?

if yes, iโ€™m wondering if some SQL could be written with a set of Common Table Expressions (CTE).  I tend to use these when trying to do multiple aggregations on different columns.  Although, that could also be somewhat complicated,  (not as easy to implement as =SUMIF).  Iโ€™m sure that my python colleagues here, have some ways to do the same thing with Pandas.  Others please jump in with your ideas.  Iโ€™m curious what others come up with.

 

It would be helpful to see a simplified version of the problem you are trying to solve. 

--Tom
daniellemetz
Level 2
Author

Hi Tom,

Currently I'm translating the aggregations to Pandas, since I'm more comfortable with that than SQL. As you said it's quite a complicated solution so I'm still keen to hear if there is a more elegant way in DataIku.

To try and explain the problem in a simplified way: there are two columns, one column contains all unique values in a set and the other column contains a subset of that set (so certain values appear multiple times, certain value only ones, and other values don't appear at all). The second column has a third column next to it containing amounts of money. I would like to sum the amounts of money, corresponding to the category in the second column, to the correct value in the first column. I understand that this can be done using a group recipe with a sum aggregation, but in this excel file there are 33 columns with one or multiple SUMIF calculations that would each require their own Group recipe. 

0 Kudos
tgb417

@daniellemetz 

When I hear about data sets that have multiple columns that need handling like this. In a case where similar or the same calculations are needed on an undefined set of columns that may change over time. I often think about changing data from a wide format to a long/tall format. 

https://www.statology.org/long-vs-wide-data/

This is something I will often do with data like you are working with.   Once the data is in a tall format the window functions of Dataiku can do the grouped calculations.  If you need the data back in a Wide-format again to share with others then, redo the wide format.  There are a bunch of visual recipe steps to do this kind of thing.  zip and unzip, fold and unfold, and pivot.  (Each attaches these challenges in slightly different ways.)  Take a look at these to see if they might be of help.

Here is some stuff you can use with Pandas to do these wide to long conversions:

https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

However, the built in visual recipes and then window or group functions can do these kinds of calculations as well.  

Give these ideas a shot and good luck.  Let us all know how you get on with this.

 

--Tom
0 Kudos
daniellemetz
Level 2
Author

Hi Tom,

Thanks for thinking along. In the end I wrote this function to simulate the SUMIF function in Excel. I'll put it here in case others find this Discussion and would like to use it. I am not the best programmer so if anyone has suggestions for improvement, feel free to leave a reply.

def sumif(df, criteria_range, criteria, sum_range): # The df is the dataframe in which criteria_range and sum_range reside, all other variables are the same as the SUMIF function in excel for a whole column.

    # Group the sum_range on the criteria_range
    group = df.groupby(criteria_range, as_index=False)[sum_range].sum()
   
    # Change the column name of criteria_range to the column name of criteria, this is needed for the merge
    column_name = criteria.columns[0]
    group = group.rename({criteria_range: column_name}, axis=1)
   
    # Merge (left join) so that the sum_range is now in the order of criteria
    merged_df = criteria.merge(group, how='left', on=column_name)
   
    # Empty values are currently Nan, change this to 0 since we might need them for multiplication
    merged_df[sum_range] = merged_df[sum_range].fillna(0)
   
    # Return only column sum_range, it is now in the order of criteria
    return merged_df[sum_range]