Creating multiple lags

sjen10 · August 2020

Hi everyone, I am trying to create a table with month and cost data (ie Item A costs $50 in Jan 2020), where if there is a null value in a given month it would take the previous month's cost information. Normally, I would use a lag, but that only allows me to reference the previous month's cost. If there are situations where it lags multiple months, I need to figure out how to fill a null value with the last month that had a value. Any ideas? See an example below.

Initial data set

Date: Item: Cost:

Jan 2020 A 10

March 2020 A 11

June 2020 A 12

New data set

Date: Item: Cost:

Jan 2020 A 10

Feb 2020 A 10

March 2020 A 11

April 2020 A 11

May 2020 A 11

June 2020 A 12

AlexandreL · August 2020

Hi, you can use the "Fill empty cells with previous/next value" processor of a prepare recipe. More info here: https://doc.dataiku.com/dss/latest/preparation/processors/up-down-fill.html

tgb417 · August 2020

@AlexandreL

How is row order controlled or set with this processor?

... Time Passes ...

As I looked at this it appears that you have to setup a sort recipe prior to a visual recipie with the fill.

sjen10 · August 2020

I have set it so that it is ordering by item, year and month. I think the fill with previous value should work if the data is ordered correctly.

AlexandreL · August 2020

Hi, a Sort recipe is the best way to set row orders. In fact, there are two different cases depending on the dataset's type:

files: you need a sort recipe that will write down the file as a single sorted file (note that you can’t emulate that, if you save a sorted dataframe in a python recipe, DSS split the files to write more efficiently so each file will be sorted but the full dataset will not be)
SQL: a sorting recipe is not really useful, you can just set the sort in the dataset properties so that DSS reads it with the appropriate ORDER BY clause

sjen10 · August 2020

The dataset is stored as a file and I am currently using the sort recipe. Will this not apply the sort to the full data set permanently? If not, would I then just do a SQL function to do an order by instead?

AlexandreL · August 2020

Hi,

The sort recipe will always sort it's output dataset. However, for better read/write performance on files datasets, the order won't be preserved by default for datasets located further down in the flow. In order to preserve row order, you can still activate the "Preserve ordering" setting in those datasets (Dataset > Settings > Advanced). NB: as stated in the documentation here (https://doc.dataiku.com/dss/latest/connecting/ordering.html) preserving ordering may decrease performance.

On SQL datasets, you can just set the sort in the same "Preserve ordering" setting so that DSS reads it with the appropriate ORDER BY clause

Creating multiple lags

Best Answer

Answers

Categories

Setup Info

Tags