Creating recipes changing number of lines in dataset

OlivierW
OlivierW Registered Posts: 22 ✭✭✭✭

Hello,

I ask for your help to create a recipe that would change the number of lines of my dataset. I wish to do the following operations but am enable to create them in dataiku :

My dataset looks like :

Item number

Date

Value

Item 1

Date 1

Value 1

Item 1

Date 2

Value 2

Item 1

Date 3

Value 3

Item 2

Date 1

Value 1

Item 2

Date 2

Value 2

The operations I aim to do are :

  • Calculate the average value of each item during each period of one month
  • Exclude the lines where the value has a discrepancy of more than x% with the average monthly value of this item
  • Recalculate average values based on the cleaned data

If anyone can take the time to advise me, I would be grateful!

Have a nice day

Olivier

Best Answers

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭
    Answer ✓

    window1.pngwindow2.pngwindow3.png

    calc_diff.pngremove_rows_when.pnggroupBY.png

    @OlivierW
    I think that way you have additional column with the average value for each iteam in each month, without losing orginal value? this is what you are looking for? If yes, then we can use "Prepare recipe" and calculate difference between each value and the average and then remove these values, after that we can use "Group recipe" and group by "month" and "item" again, setting up aggregate value to "AVG" on this dataset.

    As for point 1) If you want to calculate average value in a month and ignore item name, just remove item from the window recipe.

    Let me know in case any questions

    Mateusz

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    Answer ✓

    Well, just in case is useful for the future, I implemented @emate
    solution in a project that I'm uploading here, using a test dataset.

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭
    Answer ✓

    Hi @Ignacio_Toledo
    ,

    Thanks for your solution, it worked great!

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @OlivierW
    . I think I might help you with this, but I wanted to know first if you want to solve this problem using only Visual Recipes, or is OK to use a Code recipe (python for example)

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hi Ignacio,

    Thanks for your reply.

    Ideally, I would prefer a solution using dataiku tools if possible, because I don't know Python langage. But if it is impossible, a code recipe would be great too.

    Have a nice day,

    Olivier

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi @OlivierW

    There are couple of ways to do that.

    As for your first point, you can use group recipe , but I think in your case better would be a "window recipe" since you want to remove the rows after calculating an average.

    Attached png screen shot, on how to calculate Average for each item in each month.

    Could you clarify your second point:

    • Exclude the lines where the value has a discrepancy of more than x% with the average monthly value of this item

    What conditions you want to decide which rows to remove? Like for exmaple looking at "window3.png" I attached which rows you would like to remove in each month? Once we establish that we can just remove these rows and re-calculate an average with the window recipe once again.

    For example, when the average is 3, you want to remove the rows where value is 30% higher than the average?

    Regards,
    Mateusz

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Attached screenshots...

    Edit: I'm not sure you can open the screenshots, as I see "virus scanning in progress" all the time, in case you can't please use this:

    https://we.tl/t-z56mQa5FzI

    Also I would love to hear what @Ignacio_Toledo
    think about my way of thinking, maybe he will have a better approach, or can "upgrade" mine

    Mateusz

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hi Mateusz,

    Thank you for taking the time to look at my questions.

    Unfortunatly, I was unable to open the link (IP adress not found) nor the png files. Can you submit them again?

    Also, I think I made a mistake in my description of when I need to change the number of lines in my dataset. A logical way to proceed would be :

    1) Add a column returning for each line the average value of the month it belongs to. This average value will be used in step 2) and 3) to establish which lines are too different from the monthly average.

    2) Add a column returning for each line the discrepency between the value of the line versus the average value of the month

    3) Filter the dataset to keep only the lines where the discrepency (calculated in step 3) is less than x %

    4) Create a dataset with only one line per month containing the average value of the month. In step 4, we have a "clean" monthly average because values too different of the average have been filtered out in step 3.

    I hope it makes sense.

    Kind regards,

    Olivier

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @OlivierW
    , @emate

    It took me longer to come back to the ticket than I expected. I think your solution @emate
    is great, specially if no use of code is prefered. Maybe you could share also the 'flow' to see how it looks? Unless for @OlivierW
    is not necessary.

    It would be interesting to have access to more advanced functions when grouping or windowing data. What you are trying to do here is called "sigma clipped average" in astronomy data reduction, and is not a complicated formula. But it is not easy to implement with SQL, which is the only option available when using custom grouping.

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hello @emate
    , @Ignacio_Toledo
    ,

    Thanks for your advice which was very useful. With the snapshots and the project, I was able to apply it to my dataset. I just add to extract year and month from the date, and use item / year / month for the partition before the calculation of the average. This way, I obtain a monthly average and not a daily one.

    Now that I have carried out these steps, I realise I have a further question : given that I have a large number of items, it would be useful for me to calculate the drift of monthly average for each item, namely the variable :

    monthly average value of current month - monthly average value of initial month. Could you advise me how to calculate it?

    Thanks,

    Olivier

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    @OlivierW


    Can you show an example on how the dataset looks like now, and how you want it to look with the addition of this difference in average?

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hello @emate
    ,

    I show below (highlighted in yellow) the column I would like to create. The difficulty for me is that it is not a calculation on a given line of the dataset (for example colum A @ time t1 - column B @ time t1) but a calculation on a given column between 2 different times (the time of the current line and the time of the first data point for this item).

    I hope the example is clear.

    Olivier

    Capture.JPG

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    @OlivierW


    I belive there are couple of ways to do it.

    1) I think the fastest way is to use SparkSQL recipe

    where_SQL.pnginput_syntax.png

    SELECT `item`,
    `year`,
    `month`,
    `elapsed`,
    `average`,
    `average` - FIRST_VALUE(average) OVER(
    PARTITION BY `item`, `year`
    ORDER BY `month` )
    AS `difference` FROM `windows2` - where windows2 is your input table

    2) Manually - I'm not proud of it, it's not a clean one, but it works in the end
    I'm not sure what is the data type of your "Month" column, but my idea is to use Group recipe/or Window recipe, then "Join recipe" and finally "prepare recipe".
    a) So first I group by item + year and in aggregation I set month and average = MIN
    that way I will end up for the Average value for the 1st (min) month within each Iteam/Year.
    groupBy.pngGroupByresutl.png
    b) Join recipe - basically I use join to look up these single values (above) from group recipes to the orginal file's specific product and year combination
    join_recipe.pngselectedCols.png
    join-res.png
    c) Prepare recipe - just substract column values with formula
    FINAL.png
    This is how the whole flow should look like :):
    whole_flow_.png
    Hope this works.... BTW if it will, please hit "ACCEPT AS A SOLUTION" button under this post
    Mateusz
  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @OlivierW
    , @emate

    I am uploading the second version of the project I shared yesterday, but now adding the steps to get the table with the change of the average item values as compared to the first month.

    In this case, I arrive to the solution in a different way than @emate
    . And this is the great thing (but sometimes it might be a curse too, if one has bad memory) about the tools available in DSS, which allows you to get to the same answer in different ways.

    Cheers!

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hi @emate

    Thanks again for taking the time to look at my question.

    I tried the first solution, but got an error, maybe related to the fact that I had to choose a name for the output which dataiku then could not find?

    I attach the screenshots of the steps I followed and the log of the error message below.

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    and here is the remainder of the attachments

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi @OlivierW

    I can think of two reasons, first is brackets or "blanks spaces" in a name of the columns. I had similar issue where i was getting erorrs, so I renamed column to elapsed_days and then SQL script worked just fine - so it might be that.

    Other I dont know how you set the output exactly but I do that that way; input_SQL2.png

  • OlivierW
    OlivierW Registered Posts: 22 ✭✭✭✭

    Hello @emate

    I removed all spaces and brackets from variables, and used the same options as you for the output. I still get an error message. Don't worry though, I have the alternative solution of @Ignacio_Toledo
    for this step.

    Kind regards,

    Olivier

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    To be honest I have no idea why this SQL query is not working in your enviroment, I've tried it 2 times and it worked for me, since you have other solutions that works I will leave it as it is. And you can investigate this issue some other time, as sql queries comes in handy sometimes!

Setup Info
    Tags
      Help me…