how to generate rows using windows
I have to generate a date row whose conditions are as follows ,
initial condition - datetimeadd([date],1,"days")
condition expression - newrow <= DateTimeAdd([date],365,"days")
looping condition - DateTimeAdd(newrow,1,"days")
Here, date is already existing row .In alteryx we created this using generate rows tool , in Dataiku is it possible to create this using windows , if so please guide me . Or else which recipe should choose for this one .
Answers
-
Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭
Hi,
I answered your other post, which seems to be about the same challenge.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku Community.
From your post, I'm also not exactly clear what you are trying to achieve. It sounds like you may want to create a "date table" for use in reporting, so you don't miss any dates in your final output data set.
I don't recognize the language or syntax you are using above. It looks like it might be a set of Alteryx's function in their functional language. Dataiku has a function language, however it is different than Alteryx's language.
If it turns out that I've guessed correctly and you are trying to create a date table. You might find the following helpful.
If I was trying to create a data table I'd likely create a quick python recipe. Something like this might do the job.
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
def create_date_table2(start='2000-01-01', end='2050-12-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Day"] = df.Date.dt.weekday_name
df["Week"] = df.Date.dt.weekofyear
df["Quarter"] = df.Date.dt.quarter
df["Year"] = df.Date.dt.year
df["Year_half"] = (df.Quarter + 1) // 2
return df
# Thanks to this Stack Overflow post for the code above
# https://stackoverflow.com/questions/47150709/how-to-create-a-calendar-table-date-dimension-in-pandas
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Compute recipe outputs
DateTable_df = create_date_table2()
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
DateTable = dataiku.Dataset("DateTable")
DateTable.write_with_schema(DateTable_df)If I were using a SQL database under Dataiku DSS, I might use a recursive CTE to produce a date table. Finally depending on why I needed the data table, For example if I were doing time series calculations, I might use the time series plugin. There is a tutorial on using the time series plugin to prep your data. https://academy.dataiku.com/path/ml-practitioner/time-series-preparation-1
Hope one of those ideas is useful to you. If I've completely missed the mark with these few ideas; Please do come back with a more general description of your use case maybe describe without Alteryx code. Maybe giving a sense of the initial condition of the data and the hoped for end condition of the data. Someone here should be able to help.
--Tom
-
Hi ,
Thanks for your elaborate response . Actually I am migrating workflow from Alteryx to Dataiku , there I had challenge in generating a new column for each row based on existing column . In Alteryx we used generate row tool and I have already mentioned the conditions we used in Alteryx . Here in Dataiku I am trying to achieve this by using windows recipe computed column , but I am clueless how to execute this same function .
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
So from your most recent note above it sounds like you are just adding a new column based on data in the current row with no reference to any other rows in your dataset. Is that a correct understanding of what you are trying to do? If that is the case the window recipie is not going to be of much help; the visual prepare is where to go. If you need to reference other rows then window recipe is where to go.
Hope that helps.
-
Your understanding is correct , here I am trying to generate a column named NEWROW , for each row of the existing column DATE . The output should look like this ,
I have to get 365 rows for each ID . Basically its like a for loop , while I am trying to achieve this using computed column I get error at the <= symbol . Hope this gives you more insight about my problem.ID DATE NEWROW existing col new col to be created 123 1/1/2021 1/2/2022 123 1/1/2021 1/3/2022 123 1/1/2021 1/4/2022 123 1/1/2021 1/5/2022 123 1/1/2021 1/6/2022 -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
In the use case you are describing I might create a date table as I have described, and then use a visual join recipe to connect the date table to your original data set. This would create all of the new rows you need.
Just my $0.02. Hope it helps.
P.S. I'd also look at whatever the problem I'm trying to solve and see if there is a way refactor the problem. The repeating of data in the way is often a flag for me that I need to re-consider the problem. Just something to consider.
-
AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker
Hi @shri
,I’m a bit late to the party here, but I stumbled upon this post from April and decided to poke around my DSS to see what I could come up with. I have good news: you can use the Prepare recipe to generate rows using the forRange function and the ‘fold an array’ processor
The forRange function has the following syntax, forRange(from, to, step, variable v, expression e), and gives you an array as its output. It iterates, starting w/ the ‘from’ value, and increments by ‘step’ while the value is less than ‘to’. At each iteration, it evaluates ‘expression e’ where v is the current value.
Here’s two examples with some sample data
Case 1: same as the use case described above where you create 365 rows, one for each day in the year following a start date
- Create a Prepare recipe
- Use the forRange formula to create a new column called array_column
- forRange(-1, 364, v, v+1)
- Creates an array with 365 values starting with 0 going all the way to 364 : [0,1,2,3,..., 364]
- Use the fold an array processor on array_column
- Creates a new row for each value in the array. The rest of your data is copied
- Use the increment date processor on Date
- Have it increment by the value found in array_column
- You now have a dataset with 365 rows, one for each day in the year!
Case 2: Let's level up
Now I want to create a new row for each day between a start_date and end_date column. Maybe each row represents a hotel booking and I want to be able to count how many people were staying in the hotel on any given day... - Create a Prepare recipe
- Use the forRange formula to create a new column called array_column
- forRange(-1, diff(end_date, start_date), 1, v, v+1)
- Creates a different array for each row:
- ID=1 : the date range from 2022/01/01 to 2022/01/03 has 3 days (inclusive), so this array has 3 values [0,1,2]
- ID=2 : the date range from 2022/01/02 to 2022/01/06 has 5 days (inclusive), so the array is [0,1,2,3,4]
- If your range is exclusive, you can adjust the to/from values as needed
- Use the fold an array processor on array_column
- Creates a new row for each value in the array. The rest of your data is copied
- Use the increment date processor on start_date
- Have it increment by the value found in array_column and put that value into new_date
Enjoy & I hope this helps!
Ashley