Creating multiple lags
Hi everyone, I am trying to create a table with month and cost data (ie Item A costs $50 in Jan 2020), where if there is a null value in a given month it would take the previous month's cost information. Normally, I would use a lag, but that only allows me to reference the previous month's cost. If there are situations where it lags multiple months, I need to figure out how to fill a null value with the last month that had a value. Any ideas? See an example below.
Initial data set
Date: Item: Cost:
Jan 2020 A 10
March 2020 A 11
June 2020 A 12
New data set
Date: Item: Cost:
Jan 2020 A 10
Feb 2020 A 10
March 2020 A 11
April 2020 A 11
May 2020 A 11
June 2020 A 12
Best Answer
-
Hi, you can use the "Fill empty cells with previous/next value" processor of a prepare recipe. More info here: https://doc.dataiku.com/dss/latest/preparation/processors/up-down-fill.html
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
How is row order controlled or set with this processor?
... Time Passes ...
As I looked at this it appears that you have to setup a sort recipe prior to a visual recipie with the fill.
-
I have set it so that it is ordering by item, year and month. I think the fill with previous value should work if the data is ordered correctly.
-
Hi, a Sort recipe is the best way to set row orders. In fact, there are two different cases depending on the dataset's type:
- files: you need a sort recipe that will write down the file as a single sorted file (note that you can’t emulate that, if you save a sorted dataframe in a python recipe, DSS split the files to write more efficiently so each file will be sorted but the full dataset will not be)
- SQL: a sorting recipe is not really useful, you can just set the sort in the dataset properties so that DSS reads it with the appropriate ORDER BY clause
-
The dataset is stored as a file and I am currently using the sort recipe. Will this not apply the sort to the full data set permanently? If not, would I then just do a SQL function to do an order by instead?
-
Hi,
The sort recipe will always sort it's output dataset. However, for better read/write performance on files datasets, the order won't be preserved by default for datasets located further down in the flow. In order to preserve row order, you can still activate the "Preserve ordering" setting in those datasets (Dataset > Settings > Advanced). NB: as stated in the documentation here (https://doc.dataiku.com/dss/latest/connecting/ordering.html) preserving ordering may decrease performance.
On SQL datasets, you can just set the sort in the same "Preserve ordering" setting so that DSS reads it with the appropriate ORDER BY clause