how to generate rows using windows

shri
shri Registered Posts: 3 ✭✭✭
edited August 5 in Using Dataiku

I have to generate a date row whose conditions are as follows ,

initial condition - datetimeadd([date],1,"days")

condition expression - newrow <= DateTimeAdd([date],365,"days")

looping condition - DateTimeAdd(newrow,1,"days")

Here, date is already existing row .In alteryx we created this using generate rows tool , in Dataiku is it possible to create this using windows , if so please guide me . Or else which recipe should choose for this one .

Tagged:

Answers

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭

    Hi,

    I answered your other post, which seems to be about the same challenge.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    edited July 17

    @shri

    Welcome to the Dataiku Community.

    From your post, I'm also not exactly clear what you are trying to achieve. It sounds like you may want to create a "date table" for use in reporting, so you don't miss any dates in your final output data set.

    I don't recognize the language or syntax you are using above. It looks like it might be a set of Alteryx's function in their functional language. Dataiku has a function language, however it is different than Alteryx's language.

    If it turns out that I've guessed correctly and you are trying to create a date table. You might find the following helpful.

    If I was trying to create a data table I'd likely create a quick python recipe. Something like this might do the job.

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    def create_date_table2(start='2000-01-01', end='2050-12-31'):
    df = pd.DataFrame({"Date": pd.date_range(start, end)})
    df["Day"] = df.Date.dt.weekday_name
    df["Week"] = df.Date.dt.weekofyear
    df["Quarter"] = df.Date.dt.quarter
    df["Year"] = df.Date.dt.year
    df["Year_half"] = (df.Quarter + 1) // 2
    return df
    # Thanks to this Stack Overflow post for the code above
    # https://stackoverflow.com/questions/47150709/how-to-create-a-calendar-table-date-dimension-in-pandas

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Compute recipe outputs

    DateTable_df = create_date_table2()

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write recipe outputs
    DateTable = dataiku.Dataset("DateTable")
    DateTable.write_with_schema(DateTable_df)

    If I were using a SQL database under Dataiku DSS, I might use a recursive CTE to produce a date table. Finally depending on why I needed the data table, For example if I were doing time series calculations, I might use the time series plugin. There is a tutorial on using the time series plugin to prep your data. https://academy.dataiku.com/path/ml-practitioner/time-series-preparation-1

    Hope one of those ideas is useful to you. If I've completely missed the mark with these few ideas; Please do come back with a more general description of your use case maybe describe without Alteryx code. Maybe giving a sense of the initial condition of the data and the hoped for end condition of the data. Someone here should be able to help.

    --Tom

  • shri
    shri Registered Posts: 3 ✭✭✭

    Hi ,

    Thanks for your elaborate response . Actually I am migrating workflow from Alteryx to Dataiku , there I had challenge in generating a new column for each row based on existing column . In Alteryx we used generate row tool and I have already mentioned the conditions we used in Alteryx . Here in Dataiku I am trying to achieve this by using windows recipe computed column , but I am clueless how to execute this same function .

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @shri

    So from your most recent note above it sounds like you are just adding a new column based on data in the current row with no reference to any other rows in your dataset. Is that a correct understanding of what you are trying to do? If that is the case the window recipie is not going to be of much help; the visual prepare is where to go. If you need to reference other rows then window recipe is where to go.

    Hope that helps.

  • shri
    shri Registered Posts: 3 ✭✭✭
    Your understanding is correct , here I am trying to generate a column named NEWROW , for each row of the existing column DATE . The output should look like this ,
    IDDATENEWROW
    existing colnew col to be created
    1231/1/20211/2/2022
    1231/1/20211/3/2022
    1231/1/20211/4/2022
    1231/1/20211/5/2022
    1231/1/20211/6/2022
    I have to get 365 rows for each ID . Basically its like a for loop , while I am trying to achieve this using computed column I get error at the <= symbol . Hope this gives you more insight about my problem.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @shri

    In the use case you are describing I might create a date table as I have described, and then use a visual join recipe to connect the date table to your original data set. This would create all of the new rows you need.

    Just my $0.02. Hope it helps.

    P.S. I'd also look at whatever the problem I'm trying to solve and see if there is a way refactor the problem. The repeating of data in the way is often a flag for me that I need to re-consider the problem. Just something to consider.

  • Ashley
    Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker

    Hi @shri
    ,

    I’m a bit late to the party here, but I stumbled upon this post from April and decided to poke around my DSS to see what I could come up with. I have good news: you can use the Prepare recipe to generate rows using the forRange function and the ‘fold an array’ processor

    The forRange function has the following syntax, forRange(from, to, step, variable v, expression e), and gives you an array as its output. It iterates, starting w/ the ‘from’ value, and increments by ‘step’ while the value is less than ‘to’. At each iteration, it evaluates ‘expression e’ where v is the current value.

    Here’s two examples with some sample data

    Case 1: same as the use case described above where you create 365 rows, one for each day in the year following a start date

    Starting dataset

    1. Create a Prepare recipe
    2. Use the forRange formula to create a new column called array_column
      1. forRange(-1, 364, v, v+1)
      2. Creates an array with 365 values starting with 0 going all the way to 364 : [0,1,2,3,..., 364]
    3. Use the fold an array processor on array_column
      1. Creates a new row for each value in the array. The rest of your data is copied
    4. Use the increment date processor on Date
      1. Have it increment by the value found in array_column
    5. You now have a dataset with 365 rows, one for each day in the year!

    Resulting dataset w/ 365 rows

    Case 2: Let's level up Now I want to create a new row for each day between a start_date and end_date column. Maybe each row represents a hotel booking and I want to be able to count how many people were staying in the hotel on any given day...

    Starting dataset

    1. Create a Prepare recipe
    2. Use the forRange formula to create a new column called array_column
      1. forRange(-1, diff(end_date, start_date), 1, v, v+1)
      2. Creates a different array for each row:
        1. ID=1 : the date range from 2022/01/01 to 2022/01/03 has 3 days (inclusive), so this array has 3 values [0,1,2]
        2. ID=2 : the date range from 2022/01/02 to 2022/01/06 has 5 days (inclusive), so the array is [0,1,2,3,4]
        3. If your range is exclusive, you can adjust the to/from values as needed
      3. Use the fold an array processor on array_column
        1. Creates a new row for each value in the array. The rest of your data is copied
      4. Use the increment date processor on start_date
        1. Have it increment by the value found in array_column and put that value into new_date

    Resulting dataset w/ 8 rows

    Enjoy & I hope this helps!

    Ashley

Setup Info
    Tags
      Help me…