Monthly Partitioning changes partition column value
I am trying to setup monthly partitioning on a date column in my snowflake database. I have the source table and output dataset set as monthly partitioning. In the middle I have a prepare recipe where I use the time range to get a month (screenshot below), the output of the posting_date field changes from an actual date, to the beginning of the month. I suspect it does this because I chose a monthly partition?
So I tried changing the partitioning to the day level. When I do this, I think it keeps the posting_date field as a day level, but this causes an insane slowdown. It is now doing the same query for each day instead of each month, but the queries aren't any faster, so essentially this makes the process about 30 times slower. I am brand new to dataiku and assume I am doing something wrong, because this seems like really strange behavior. Do I need to add a duplicate posting_date field, one for partitioning and one to keep the original values?
Here is the input/output
Operating system used: Windows
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
Hi @Gipple
,
To preserve your original timestamp you could create an actual partitioning column in the format prior to partitioning the dataset using a prepared recipe for example
https://doc.dataiku.com/dss/latest/partitions/sql_datasets.html#sql-datasets-and-time-partitioning
The 30x slower part is due to the fact you are collecting partitions from the partition dependency defined. Meaning you are running 1 month for each day e.g, 30x more data is being processed and stored compared to a single day. Your output will be 30x larger.
You don't need this unless you want to collect data for archival or in specific cases.
See:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html#:~:text=Collecting%20partitions%20allows%20us%20to,to%20a%20non%2Dpartitioned%20dataset.
Try changing the partition dependency to "equals." If you want to rebuild run for 30 days' worth of data, simply select the time range and run the job from :To run the recipe for 30 days, you can use the time range from the run option/ scenario.
Answers
-
Thanks for your response. This has been really helpful