Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

How to generate multiple rows for each row

Level 3
How to generate multiple rows for each row

I am converting an Alteryx workflow to Dataiku and one option in Alteryx is to create multiple rows based on some criteria. I would like to take each row in my data set and copy it , adding a new column called MonthDate that starts with January of this year and goes to the current month so it keeps adding months as the month progresses in the calendar year.

For instance I have 2 rows:

ID   Name

1     Mary Smith

2      Joe Henry

 

If I choose to start at 2020-05-01 as the date I would see:

1   Mary Smith   2020-05-01

1   Mary Smith   2020-06-01

1   Mary Smith   2020-07-01

1   Mary Smith   2020-08-01

2   Joe Henry     2020-05-01

etc.

What is the best way to do this in Dataiku as I don't see anything that would be able to do this. Do I need to use python?

Thanks for all the help in advance!

0 Kudos
4 Replies
Dataiker
Dataiker

Hi @aw30, you can go with a full Python recipe in this situation, or you can go "Python lite" by using a Python function within the Prepare recipe in "Rows" mode.  See, for example: https://academy.dataiku.com/advanced-prepare-recipe-usage/513260.  This can be convenient if you have other Prepare steps to take.

0 Kudos
Level 3
Author

Thank you - I did see this option and will now look into it.

0 Kudos
Dataiker
Dataiker

Hi, for this use case you can use a cross join (or cartesian product). In your case you will need your original dataset and a second dataset containing one row per date of interest, so the dataset should look like this:

Date

2020-05-01

2020-06-01

2020-07-01

etc.

After that you can open a join recipe, select cross join as join type, and then select your columns of interest. You can find more information on how to setup a join recipe here: https://academy.dataiku.com/visual-recipes-overview/479217

Level 3
Author

Thank you for the response! I am aware of this option but wondered if there was an option that works on demand instead of having a second dataset with all the dates. I'll mark this as a solution because it is a way to accomplish this. Thank you again!

0 Kudos