SQL OVER(PARTITION BY ... ORDER BY partition_key) on partitioned dataset
julesbertrand
Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 Partner
Hi,
I have a dataset partitioned by month_key (string type, not date), e.g. 202112 for December 2021.
I am working on customer data, and I want to flag changes from month to month in a column.
Original table is like
client_id | month_key (partition key) | client_attribute |
000000001 | 202110 | A |
000000001 | 202111 | A |
000000001 | 202112 | B |
000000001 | 202201 | B |
And I want:
client_id | month_key (partition key) | client_attribute_change_flag |
000000001 | 202110 | False |
000000001 | 202111 | False |
000000001 | 202112 | True |
000000001 | 202201 | False |
I successfully used an SQL OVER clause on an unpartitioned version of the dataset:
client_attribute != LAG(client_attribute) OVER (PARTITION BY `client_id` ORDER BY `month_key`) AS client_attribute_change_flag
Now it is partitioned, i can't re use the same recipe AS I don't have access to month_key and month_key - 1. Do you have any advice on how to do this ? Should I temporarily un partition the dataset, compute my flag column and join back with my partitioned dataset ?
Thanks
Jules