How to do a multi-row formula or similar process
Hi,
I am attempting to get the below input data into the following output format:
INPUT
(row1) Food type | Cost |
(row2) Dog Food Section Header | null |
(row3) Dog Food 1 | $10 |
(row4) Dog Food 2 | $12 |
(row5) Dog Food 3 | $14 |
(row6) Cat Food Section Header | null |
(row7) Cat Food 1 | $11 |
(row8) Cat Food 2 | $13 |
(row9) Cat Food 3 | $15 |
DESIRED OUTPUT
(row1) Animal Type | Food type | Cost |
(row2) Dog | Dog Food 1 | $10 |
(row3) Dog | Dog Food 2 | $12 |
(row4) Dog | Dog Food 3 | $14 |
(row5) Cat | Cat Food 1 | $11 |
(row6) Cat | Cat Food 2 | $13 |
(row7) Cat | Cat Food 3 | $15 |
Each food type is its own row. I looked throughout the community but could not find an answer. Thanks for your help - new to Dataiku.
Operating system used: Windows
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku Community.
In looking at your INPUT table I see some rows with 1 column and some with 2.
Rows 1,2,3, 5,6,7 look to have the 2nd and third columns of your output data.
In a Visual recipe, you could filter your rows to only rows that have the second column filled in on your sample input data.
That said I'm not clear what indicates the first column of your output set. Is it:
- That the row has the word Dog or Cat in it somewhere?
- Or are those rows that do not have price values the data element that indicates Dog or cat food? Row 0 and 4.
Depending on the answer to this question. Your may first need to sort out the values for your first column.
In the first case either split the first column or regex, of formula with a search could pull the words Dog and Cat out of your rows with data.
In the second case:
- You likely need to look for rows missing the second column.
- And then you need to bring these values to a new column.
- Then in this new column truncate the word food from those rows.
- And then use the visual recipe step "Fill empty cells with previous/next value" to fill down until the next row that has a value.
Hope this helps just a bit. Let us know how you are getting on with sorting this data.
--Tom
-
mluu Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭✭✭
Thanks for the quick and detailed response. I clarified my original post, but based on your description I perceive it is the second scenario you described. I want to remove "Dog/Cat Food Section Header" and make it a column so that it becomes a column a user can filter on in the output (Excel).
I also perceive the function I was looking for was the "Fill empty cells with previous/next value" function you mentioned. Thanks. I will try it out.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Excellent to hear that this has been helpful.
Let us know how you get on with your project.
--Tom
-
mluu Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭✭✭
I have a follow up that might be clearer to understand. In the "Example problem" screenshot, I want to copy "Production" in the new energy type down for all of the "TRUE" values until it hits "Imports". Then, I want it to copy down "Imports" for all of the "TRUE" values until it hits "Exports", etc. In short, I am attempting to create a summarized energy type field to make it easier to filter on.
Is there a way to do this using a visual recipe (i.e. no python code)?
If helpful, I included my original formula to create the new energy type column.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I’m writing from an iPad. I don’t have assess to DSS right at the moment. I’m doing this from memory so I may get a detail or two off by a little bit.
You are doing much of what I would do. In general, I would look to see if the energy production column is empty, and then grab what is in energy type field and put it into a new column as you are doing. There may be an isBlank function that will help with this process.
Then rather than putting True in the new column. I’d likely try to leave the column blank. I suspect that the fill down feature expects blank cells in the column to fill down. You may have to use the special value null in your new column in order to make this work. When this is working you should see a blank set of rows in your new column. Separated by your groupings titles. Then I’d run the fill down process as we have discussed.
Some other hints that might be helpful. DSS formula language is case sensitive. (This always gets in my way.)
There is also a display feature in DSS that will show highlighted space or tab characters. These hidden “white spaces” characters can cause problems with these tasks on messy data.
Alternately, you could use the formula you already have and add a step to replace True with an empty string “” or null .
Then you will likely to use the formula step that removes rows with missing values in the energy production column.
Happy holidays.—Tom
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
hi,
A slightly different approach can be :
- split column “Energy description on “:”, truncate to 2 columns, the first to be deleted later
- formula-processor on the second column with output in that same column, which checks for empty values and fills them with the value in “Energy type”. As original values in this column get mentioned you can strip extra spaces here aswell. For example : if(isNull(val("<column_name>")), val("<column_name>"), trim(val("<column_name>")))
- rename columns and delete first result-column from split
The given formula for checking if “Energy” is mentioned in “Energy description” can be added to the mentioned formula-step as these can be nested.
EDIT : link to the documentation on formula language