Designing for relative time frames

wjkelly · August 2022

Hi,

I have datasets that I will ingest weekly and run some data manipulation that yields several outputs, including calculated fields that are meant to sum the "Last six months" of sales activity by brand or customer, for example.

Right now, the sales data is coming in with item details, sales date/time, customer details, etc. I've been calculating a "Sale Month" field, then using that field and a Pivot to create sales by month tables (months are the columns, customer and/or brand are the rows).

Then I have a calculation of "Six month spend" and "Six month orders placed". The Six month spend field is calculated by summing up the values in each of the past six full months of sales activity. The formula that's used to calculate this sum explicitly references the field names from the pivot. So, as I go through time, each month I'm going to have to redo those calculated fields, retiring one explicit field reference and adding another explicit field reference in the calculation.

I'm hoping there's a method that allows me to either reference fields in some dynamic way, or allows me to keep the pivot field names and whittle down the fields based on the relative relationship of the time data in each purchase record.

Any good methodologies out there for something like this?

Operating system used: MacOS Monterey v12.4

JordanB · August 2022

Hi @wjkelly
,

Thanks for writing in! I have created a test project based off what you have described here. I am using products as rows and months as my columns (1_total_sum to 12_total_sum). Each product has a monthly purchase total.

Screen Shot 2022-08-05 at 10.36.40 AM.png

If I understand correctly, you are using a formula in a Prepare recipe to total the past 6 month spend and orders placed. For that, you are writing out the column names in a sum() formula. Your goal is to be able to use this formula for future datasets?

If that's the case, I would recommend considering a couple of options. First, the Pivot recipe provides an option to simplify the column output. If you use a numbering system for the month columns, you should be able to work off one formula for your calculated fields.

Screen Shot 2022-08-05 at 10.47.20 AM.png

Second, you may rename your columns in a Prepare recipe step prior to the formula step. If you rename your columns with a consistent naming system, you should be able to use the same formula each time.

Please let me know whether I have properly addressed your concerns.

Thanks!

Jordan

wjkelly · August 2022

Hi Jordan8,

Thanks for the recommendation. I think the challenge I'm having is how to write a formula that is dynamic in how it references the columns. It seems a little like an INDIRECT() formula call in excel -- in Dataiku, how do I do a multi-column sum and reference the columns dynamically?

I'd imagine something like:

sum("month_" & X+1, "month_" & X+2, etc) where the column names are month_1, month_2, month_3, etc. and X is a starting point, maybe 0.

Does this make sense?

JordanB · August 2022

Hi @wjkelly
,

To accomplish what I believe you're trying to do, I highly recommend using a Python recipe rather than a formula. With the Pandas package, you can reference dataset columns dynamically (by column index), as shown below. You can define new columns and perform mathematical operations as well. There is a lot more flexibility with Python and you can find tons of examples online on performing operations on datasets.

import dataikuimport pandas as pd, numpy as npfrom dataiku import pandasutils as pdu# Read recipe inputsfood_order = dataiku.Dataset("food_order")df = food_order_prepared.get_dataframe()col_list = list(df) # get all columnsdf['Sum_all_cols'] = df[col_list].sum(axis=1) #sum all columns# Select columns 1 and 2 to sum Using DataFrame.iloc[]df['Sum_2_cols']=df.iloc[:,[1,2]].sum(axis=1)# Select columns 1 and 2 to sum and multiplydf['2_col_operations'] = df.iloc[:,[1,2]].sum(axis=1).mul(2)python_prepare_sum_df = df # For this sample code, simply copy input to output# Write recipe outputspython_prepare_sum = dataiku.Dataset("python_prepare_sum")python_prepare_sum.write_with_schema(python_prepare_sum_df)

Output:

Screen Shot 2022-08-22 at 5.50.45 PM.png

Please let me know if you have any questions about this.

Thanks!

Jordan

Designing for relative time frames

Answers

Categories

Setup Info

Tags