Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

how to generate rows using windows

shri
Level 1
how to generate rows using windows

I have to generate a date row  whose conditions are as follows ,

initial condition - datetimeadd([date],1,"days")

condition expression - newrow <= DateTimeAdd([date],365,"days")

looping condition - DateTimeAdd(newrow,1,"days")

Here, date is  already existing row  .In alteryx we created this using generate rows tool , in Dataiku is it possible to create this using windows , if so please guide me . Or else which recipe should choose for this one .

0 Kudos
7 Replies
Manuel
Dataiker Alumni

Hi,

I answered your other post, which seems to be about the same challenge.

 

0 Kudos
tgb417

@shri 

Welcome to the Dataiku Community.

From your post, I'm also not exactly clear what you are trying to achieve.  It sounds like you may want to create a "date table" for use in reporting, so you don't miss any dates in your final output data set.

I don't recognize the language or syntax you are using above.  It looks like it might be a set of Alteryx's function in their functional language.  Dataiku has a function language, however it is different than Alteryx's language.

If it turns out that I've guessed correctly and you are trying to create a date table. You might find the following helpful.

If I was trying to create a data table I'd likely create a quick python recipe.  Something like this might do the job.

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
def create_date_table2(start='2000-01-01', end='2050-12-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Day"] = df.Date.dt.weekday_name
df["Week"] = df.Date.dt.weekofyear
df["Quarter"] = df.Date.dt.quarter
df["Year"] = df.Date.dt.year
df["Year_half"] = (df.Quarter + 1) // 2
return df
# Thanks to this Stack Overflow post for the code above
# https://stackoverflow.com/questions/47150709/how-to-create-a-calendar-table-date-dimension-in-pandas

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Compute recipe outputs

DateTable_df = create_date_table2()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
DateTable = dataiku.Dataset("DateTable")
DateTable.write_with_schema(DateTable_df)

 If I were using a SQL database under Dataiku DSS, I might use a recursive CTE to produce a date table.  Finally depending on why I needed the data table, For example if I were doing time series calculations, I might use the time series plugin.  There is a tutorial on using the time series plugin to prep your data.  https://academy.dataiku.com/path/ml-practitioner/time-series-preparation-1 

Hope one of those ideas is useful to you. If I've completely missed the mark with these few ideas; Please do come back with a more general description of your use case maybe describe without Alteryx code.   Maybe giving a sense of the initial condition of the data and the hoped for end condition of the data. Someone here should be able to help.

--Tom

--Tom
0 Kudos
shri
Level 1
Author

Hi ,

Thanks for your elaborate response . Actually I am migrating workflow from Alteryx to Dataiku , there I had challenge in generating a new column for each row based on existing column . In Alteryx we used generate row tool and I have already mentioned the conditions we used in Alteryx . Here in Dataiku I am trying to achieve this by using windows recipe  computed column , but I am clueless how to execute this same function .

0 Kudos
tgb417

@shri 

So from your most recent note above it sounds like you are just adding a new column based on data in the current row with no reference to any other rows in your dataset.  Is that a correct understanding of what you are trying to do?  If that is the case the window recipie is not going to be of much help; the visual prepare is where to go.  If you need to reference other rows then window recipe is where to go.

Hope that helps.   

--Tom
0 Kudos
shri
Level 1
Author
Spoiler (Read more)
Your understanding is correct , here I am trying to generate a column named NEWROW , for each row of the existing column  DATE . The output should look like this ,

IDDATENEWROW
 existing colnew col to be created
1231/1/20211/2/2022
1231/1/20211/3/2022
1231/1/20211/4/2022
1231/1/20211/5/2022
1231/1/20211/6/2022

I  have to get 365 rows for each ID . Basically its like a for  loop , while I am trying to achieve this using computed column I get error at the <= symbol . Hope this  gives you more  insight about my problem.

 

0 Kudos
tgb417

@shri 

In the use case you are describing I might create a date table as I have described, and then use a visual join recipe to connect the date table to your original data set.  This would create all of the new rows you need.

Just my $0.02.  Hope it helps.

P.S.  I'd also look at whatever the problem I'm trying to solve and see if there is a way refactor the problem.  The repeating of data in the way is often a flag for me that I need to re-consider the problem.  Just something to consider.

--Tom
AshleyW
Dataiker

Hi @shri,

I’m a bit late to the party here, but I stumbled upon this post from April and decided to poke around my DSS to see what I could come up with. I have good news: you can use the Prepare recipe to generate rows using the forRange function and the ‘fold an array’ processor 🎉

The forRange function has the following syntax, forRange(from, to, step, variable v, expression e), and gives you an array as its output. It iterates, starting w/ the ‘from’ value, and increments by ‘step’ while the value is less than ‘to’. At each iteration, it evaluates ‘expression e’ where v is the current value.

Here’s two examples with some sample data

Case 1: same as the use case described above where you create 365 rows, one for each day in the year following a start date

Starting datasetStarting dataset

  1. Create a Prepare recipe
  2. Use the forRange formula to create a new column called array_column
    1. forRange(-1, 364, v, v+1)
    2. Creates an array with 365 values starting with 0 going all the way to 364 : [0,1,2,3,..., 364]
  3. Use the fold an array processor on array_column
    1. Creates a new row for each value in the array. The rest of your data is copied
  4. Use the increment date processor on Date
    1. Have it increment by the value found in array_column
  5. You now have a dataset with 365 rows, one for each day in the year!

Resulting dataset w/ 365 rowsResulting dataset w/ 365 rows

Case 2: Let's level up 💪 Now I want to create a new row for each day between a start_date and end_date column. Maybe each row represents a hotel booking and I want to be able to count how many people were staying in the hotel on any given day...

Starting datasetStarting dataset

  1. Create a Prepare recipe
  2. Use the forRange formula to create a new column called array_column
    1. forRange(-1, diff(end_date, start_date), 1, v, v+1)
    2. Creates a different array for each row:
      1. ID=1 : the date range from 2022/01/01 to 2022/01/03 has 3 days (inclusive), so this array has 3 values [0,1,2]
      2. ID=2 : the date range from 2022/01/02 to 2022/01/06 has 5 days (inclusive), so the array is [0,1,2,3,4]
      3. If your range is exclusive, you can adjust the to/from values as needed 🤓
    3. Use the fold an array processor on array_column
      1. Creates a new row for each value in the array. The rest of your data is copied
    4. Use the increment date processor on start_date
      1. Have it increment by the value found in array_column and put that value into new_date

Resulting dataset w/ 8 rowsResulting dataset w/ 8 rows

 

Enjoy & I hope this helps!

Ashley