Monthly Partitioning changes partition column value

Gipple · February 23

I am trying to setup monthly partitioning on a date column in my snowflake database. I have the source table and output dataset set as monthly partitioning. In the middle I have a prepare recipe where I use the time range to get a month (screenshot below), the output of the posting_date field changes from an actual date, to the beginning of the month. I suspect it does this because I chose a monthly partition?

So I tried changing the partitioning to the day level. When I do this, I think it keeps the posting_date field as a day level, but this causes an insane slowdown. It is now doing the same query for each day instead of each month, but the queries aren't any faster, so essentially this makes the process about 30 times slower. I am brand new to dataiku and assume I am doing something wrong, because this seems like really strange behavior. Do I need to add a duplicate posting_date field, one for partitioning and one to keep the original values?

Here is a screenshot of the posting_date getting changed to the first of the month. I made a second posting_date to just carry through the recipe so I could see what the value was.

Here is the input/output

Operating system used: Windows

Alexandru · February 23

Hi @Gipple
,

To preserve your original timestamp you could create an actual partitioning column in the format prior to partitioning the dataset using a prepared recipe for example
https://doc.dataiku.com/dss/latest/partitions/sql_datasets.html#sql-datasets-and-time-partitioning

The 30x slower part is due to the fact you are collecting partitions from the partition dependency defined. Meaning you are running 1 month for each day e.g, 30x more data is being processed and stored compared to a single day. Your output will be 30x larger.

You don't need this unless you want to collect data for archival or in specific cases.
See:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html#:~:text=Collecting%20partitions%20allows%20us%20to,to%20a%20non%2Dpartitioned%20dataset.

Try changing the partition dependency to "equals." If you want to rebuild run for 30 days' worth of data, simply select the time range and run the job from :

Screenshot 2024-02-23 at 6.36.08 PM.png

To run the recipe for 30 days, you can use the time range from the run option/ scenario.

Gipple · February 23

Thanks for your response. This has been really helpful

Monthly Partitioning changes partition column value

Best Answer

Answers

Categories

Setup Info

Tags