Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have datasets that I will ingest weekly and run some data manipulation that yields several outputs, including calculated fields that are meant to sum the "Last six months" of sales activity by brand or customer, for example.
Right now, the sales data is coming in with item details, sales date/time, customer details, etc. I've been calculating a "Sale Month" field, then using that field and a Pivot to create sales by month tables (months are the columns, customer and/or brand are the rows).
Then I have a calculation of "Six month spend" and "Six month orders placed". The Six month spend field is calculated by summing up the values in each of the past six full months of sales activity. The formula that's used to calculate this sum explicitly references the field names from the pivot. So, as I go through time, each month I'm going to have to redo those calculated fields, retiring one explicit field reference and adding another explicit field reference in the calculation.
I'm hoping there's a method that allows me to either reference fields in some dynamic way, or allows me to keep the pivot field names and whittle down the fields based on the relative relationship of the time data in each purchase record.
Any good methodologies out there for something like this?
Operating system used: MacOS Monterey v12.4
Thanks for writing in! I have created a test project based off what you have described here. I am using products as rows and months as my columns (1_total_sum to 12_total_sum). Each product has a monthly purchase total.
If I understand correctly, you are using a formula in a Prepare recipe to total the past 6 month spend and orders placed. For that, you are writing out the column names in a sum() formula. Your goal is to be able to use this formula for future datasets?
If that's the case, I would recommend considering a couple of options. First, the Pivot recipe provides an option to simplify the column output. If you use a numbering system for the month columns, you should be able to work off one formula for your calculated fields.
Second, you may rename your columns in a Prepare recipe step prior to the formula step. If you rename your columns with a consistent naming system, you should be able to use the same formula each time.
Please let me know whether I have properly addressed your concerns.
To accomplish what I believe you're trying to do, I highly recommend using a Python recipe rather than a formula. With the Pandas package, you can reference dataset columns dynamically (by column index), as shown below. You can define new columns and perform mathematical operations as well. There is a lot more flexibility with Python and you can find tons of examples online on performing operations on datasets.
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs food_order = dataiku.Dataset("food_order") df = food_order_prepared.get_dataframe() col_list = list(df) # get all columns df['Sum_all_cols'] = df[col_list].sum(axis=1) #sum all columns # Select columns 1 and 2 to sum Using DataFrame.iloc df['Sum_2_cols']=df.iloc[:,[1,2]].sum(axis=1) # Select columns 1 and 2 to sum and multiply df['2_col_operations'] = df.iloc[:,[1,2]].sum(axis=1).mul(2) python_prepare_sum_df = df # For this sample code, simply copy input to output # Write recipe outputs python_prepare_sum = dataiku.Dataset("python_prepare_sum") python_prepare_sum.write_with_schema(python_prepare_sum_df)
Please let me know if you have any questions about this.
Thanks for the recommendation. I think the challenge I'm having is how to write a formula that is dynamic in how it references the columns. It seems a little like an INDIRECT() formula call in excel -- in Dataiku, how do I do a multi-column sum and reference the columns dynamically?
I'd imagine something like:
sum("month_" & X+1, "month_" & X+2, etc) where the column names are month_1, month_2, month_3, etc. and X is a starting point, maybe 0.
Does this make sense?