Monthly Partitioning changes partition column value

Solved!
Gipple
Level 2
Monthly Partitioning changes partition column value

I am trying to setup monthly partitioning on a date column in my snowflake database. I have the source table and output dataset set as monthly partitioning. In the middle I have a prepare recipe where I use the time range to get a month (screenshot below), the output of the posting_date field changes from an actual date, to the beginning of the month. I suspect it does this because I chose a monthly partition?

So I tried changing the partitioning to the day level. When I do this, I think it keeps the posting_date field as a day level, but this causes an insane slowdown. It is now doing the same query for each day instead of each month, but the queries aren't any faster, so essentially this makes the process about 30 times slower. I am brand new to dataiku and assume I am doing something wrong, because this seems like really strange behavior. Do I need to add a duplicate posting_date field, one for partitioning and one to keep the original values? 

 

Here is a screenshot of the posting_date getting changed to the first of the month. I made a second posting_date to just carry through the recipe so I could see what the value was.

Capture2.PNG

Here is the input/output

Capture.PNG


Operating system used: Windows

0 Kudos
1 Solution
AlexT
Dataiker

Hi @Gipple ,

To preserve your original timestamp you could create an actual partitioning column in the format prior to partitioning the dataset using a prepared recipe for example
https://doc.dataiku.com/dss/latest/partitions/sql_datasets.html#sql-datasets-and-time-partitioning

The 30x slower part is due to the fact you are collecting partitions from the partition dependency defined. Meaning you are running 1 month for each day e.g, 30x more data is being processed and stored compared to a single day. Your output will be 30x larger.

You don't need this unless you want to collect data for archival or in specific cases.
See:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html#:~:text=Collect....

Try changing the partition dependency to "equals." If you want to rebuild run for 30 days' worth of data, simply select the time range and run the job from :

Screenshot 2024-02-23 at 6.36.08 PM.png

 

To run the recipe for 30 days, you can use the time range from the run option/ scenario.



View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker

Hi @Gipple ,

To preserve your original timestamp you could create an actual partitioning column in the format prior to partitioning the dataset using a prepared recipe for example
https://doc.dataiku.com/dss/latest/partitions/sql_datasets.html#sql-datasets-and-time-partitioning

The 30x slower part is due to the fact you are collecting partitions from the partition dependency defined. Meaning you are running 1 month for each day e.g, 30x more data is being processed and stored compared to a single day. Your output will be 30x larger.

You don't need this unless you want to collect data for archival or in specific cases.
See:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html#:~:text=Collect....

Try changing the partition dependency to "equals." If you want to rebuild run for 30 days' worth of data, simply select the time range and run the job from :

Screenshot 2024-02-23 at 6.36.08 PM.png

 

To run the recipe for 30 days, you can use the time range from the run option/ scenario.



0 Kudos
Gipple
Level 2
Author

Thanks for your response. This has been really helpful

0 Kudos

Labels

?
Labels (2)

Setup info

?
A banner prompting to get Dataiku