Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Creating recipes changing number of lines in dataset

Level 3
Creating recipes changing number of lines in dataset

Hello,

I ask for your help to create a recipe that would change the number of lines of my dataset. I wish to do the following operations but am enable to create them in dataiku :

My dataset looks like :

Item number

Date

Value

Item 1

Date 1

Value 1

Item 1

Date 2

Value 2

Item 1

Date 3

Value 3

Item 2

Date 1

Value 1

Item 2

Date 2

Value 2

 

The operations I aim to do are :

  • Calculate the average value of each item during each period of one month
  • Exclude the lines where the value has a discrepancy of more than x% with the average monthly value of this item
  • Recalculate average values based on the cleaned data

If anyone can take the time to advise me, I would be grateful!

Have a nice day

Olivier

 

0 Kudos
19 Replies
Level 5

Hi @OlivierW . I think I might help you with this, but I wanted to know first if you want to solve this problem using only Visual Recipes, or is OK to use a Code recipe (python for example)

0 Kudos
Level 3
Author

Hi Ignacio,

Thanks for your reply.

Ideally, I would prefer a solution using dataiku tools if possible, because I don't know Python langage. But if it is impossible, a code recipe would be great too.

Have a nice day,

Olivier

0 Kudos
Level 4

Hi @OlivierW 

There are couple of ways to do that.

As for your first point, you can use group recipe , but I think in your case better would be a "window recipe" since you want to remove the rows after calculating an average.

Attached png screen shot, on how to calculate Average for each item in each month.

Could you clarify your second point:

  • Exclude the lines where the value has a discrepancy of more than x% with the average monthly value of this item

What conditions you want to decide which rows to remove? Like for exmaple looking at "window3.png" I attached which rows you would like to remove in each month? Once we establish that we can just remove these rows and re-calculate an average with the window recipe once again.

For example, when the average is 3, you want to remove the rows where value is 30% higher than the average?

Regards,
Mateusz

Level 4

Attached screenshots...

Edit: I'm not sure you can open the screenshots, as I see "virus scanning in progress" all the time, in case you  can't please use this:

 https://we.tl/t-z56mQa5FzI  

Also I would love to hear what @Ignacio_Toledo  think about my way of thinking, maybe he will have a better approach, or can "upgrade" mine 🙂

Mateusz

Level 3
Author

Hi Mateusz,

Thank you for taking the time to look at my questions.

Unfortunatly, I was unable to open the link (IP adress not found) nor the png files. Can you submit them again?

Also, I think I made a mistake in my description of when I need to change the number of lines in my dataset. A logical way to proceed would be :

1) Add a column returning for each line the average value of the month it belongs to. This average value will be used in step 2) and 3) to establish which lines are too different from the monthly average.

2) Add a column returning for each line the discrepency between the value of the line versus the average value of the month

3) Filter the dataset to keep only the lines where the discrepency (calculated in step 3) is less than x %

4) Create a dataset with only one line per month containing the average value of the month. In step 4, we have a "clean" monthly average because values too different of the average have been filtered out in step 3.

I hope it makes sense.

Kind regards,

Olivier

0 Kudos
Level 4

window1.pngwindow2.pngwindow3.png

calc_diff.pngremove_rows_when.pnggroupBY.png

 

 

@OlivierW  I think that way you have additional column with the average value for each iteam in each month, without losing orginal value? this is what you are looking for? If yes, then we can use "Prepare recipe" and calculate difference between each value and the average and then remove these values, after that we can use "Group recipe" and group by "month" and "item" again, setting up aggregate value to "AVG" on this dataset.

 

As for point 1) If you want to calculate average value in a month and ignore item name, just remove item from the window recipe.

 

Let me know in case any questions

Mateusz

Well, just in case is useful for the future, I implemented @emate solution in a project that I'm uploading here, using a test dataset.

Level 5

Hi @OlivierW , @emate 

It took me longer to come back to the ticket than I expected. I think your solution @emate is great, specially if no use of code is prefered. Maybe you could share also the 'flow' to see how it looks? Unless for @OlivierW is not necessary.

It would be interesting to have access to more advanced functions when grouping or windowing data. What you are trying to do here is called "sigma clipped average" in astronomy data reduction, and is not a complicated formula. But it is not easy to implement with SQL, which is the only option available when using custom grouping.

Level 3
Author

Hello @emate , @Ignacio_Toledo ,

Thanks for your advice which was very useful. With the snapshots and the project, I was able to apply it to my dataset. I just add to extract year and month from the date, and use item / year / month for the partition before the calculation of the average. This way, I obtain a monthly average and not a daily one.

Now that I have carried out these steps, I realise I have a further question : given that I have a large number of items, it would be useful for me to calculate the drift of monthly average for each item, namely the variable :

monthly average value of current month - monthly average value of initial month. Could you advise me how to calculate it?

Thanks,

Olivier

0 Kudos
Level 4

@OlivierW

Can you show an example on how the dataset looks like now, and how you want it to look with the addition of this difference in average?

0 Kudos
Level 3
Author

Hello @emate ,

I show below (highlighted in yellow) the column I would like to create. The difficulty for me is that it is not a calculation on a given line of the dataset (for example colum A @ time t1 - column B @ time t1) but a calculation on a given column between 2 different times (the time of the current line and the time of the first data point for this item).

I hope the example is clear.

Olivier

Capture.JPG

0 Kudos
Level 4

@OlivierW 

I belive there are couple of ways to do it.

1) I think the fastest way is to use SparkSQL recipe

where_SQL.pnginput_syntax.png

SELECT `item`,
`year`,
`month`,
`elapsed`,
`average`,
`average` - FIRST_VALUE(average) OVER(
PARTITION BY `item`, `year`
ORDER BY `month` )
AS `difference` FROM `windows2` - where windows2 is your input table

2) Manually - I'm not proud of it, it's not a clean one, but it works in the end
 
I'm not sure what is the data type of your "Month" column, but my idea is to use Group recipe/or Window recipe, then "Join recipe" and finally "prepare recipe".
 
a) So first I group by item + year and in aggregation I set month and average = MIN
    that way I will end up for the Average value for the 1st (min) month within each       Iteam/Year.
groupBy.pngGroupByresutl.png
 
b) Join recipe - basically I use join to look up these single values (above) from group recipes to the orginal file's specific product and year combination 
join_recipe.pngselectedCols.png
join-res.png
c) Prepare recipe  - just substract column values with formula
 
FINAL.png
This is how the whole flow should look like :):
whole_flow_.png
 
 
Hope this works.... BTW if it will, please hit "ACCEPT AS A SOLUTION" button under this post 😄
 
Mateusz
Level 3
Author

Hi @emate 

Thanks again for taking the time to look at my question.

I tried the first solution, but got an error, maybe related to the fact that I had to choose a name for the output which dataiku then could not find?

I attach the screenshots of the steps I followed and the log of the error message below.

 

 

0 Kudos
Level 3
Author

and here is the remainder of the attachments

0 Kudos
Level 4

Hi @OlivierW 

I can think of two reasons, first is brackets or "blanks spaces" in a name of the columns. I had similar issue where i was getting erorrs, so I renamed column to elapsed_days and then SQL script worked just fine - so it might be that.

Other I dont know how you set the output exactly but I do that that way; input_SQL2.png

0 Kudos
Level 3
Author

Hello @emate 

I removed all spaces and brackets from variables, and used the same options as you for the output. I still get an error message. Don't worry though, I have the alternative solution of @Ignacio_Toledo for this step.

Kind regards,

Olivier

0 Kudos
Level 4

To be honest I have no idea why this SQL query is not working in your enviroment, I've tried it 2 times and it worked for me, since you have other solutions that works I will leave it as it is. And you can investigate this issue some other time, as sql queries comes in handy sometimes!

0 Kudos
Level 5

Hi @OlivierW , @emate 

I am uploading the second version of the project I shared yesterday, but now adding the steps to get the table with the change of the average item values as compared to the first month.

In this case, I arrive to the solution in a different way than @emate . And this is the great thing (but sometimes it might be a curse too, if one has bad memory) about the tools available in DSS, which allows you to get to the same answer in different ways.

Cheers!

 

Level 3
Author

Hi @Ignacio_Toledo ,

Thanks for your solution, it worked great!