Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on April 12, 2022 12:37PM
Likes: 0
Replies: 1
Hi,
I have a dataset containing hourly rain data that is discretely partitioned following IDs of pluviometers (1000 IDs).
I'd like to sub-partition this dataset by month (i.e. : /%{ID}/%Y/%M/.*), but when I try to repartition it :
_Can't use the redispatch function
_My ID column disappear
_My DATETIME column disappear
_Processing is extremely long
How can I manage all these constraints ?
Regards,
Amir
Operating system used: Linux
Hello @ACB
,
It is possible to implement a partitioning scheme involving a discrete dimension followed by a time dimension.
The initial state I’m envisioning is a dataset that was uploaded from a single csv file containing the pluviometer data.
From this initial state here are the steps you would need to perform to implement DSS-based partitioning on this dataset using a discrete dimension (ID) followed by a time dimension (month).
STEP 1 - Add a Prepare recipe (https://doc.dataiku.com/dss/latest/other_recipes/prepare.html) to the original dataset. In the prepare recipe do the following:
At end of STEP 1 the Flow looks like: `original_dataset` -> `prepare recipe` -> `original_dataset_prepared`
STEP 2 - Add a Sync recipe (https://doc.dataiku.com/dss/latest/other_recipes/sync.html) to `original_dataset_prepared` to create a copy of `original_dataset_prepared`
At the end of STEP 2 the Flow looks like: `original_dataset` -> `Prepare recipe` -> `original_dataset_prepared` -> `Sync recipe` -> `original_dataset_prepared_copy`
STEP 3 - Define the desired partitioning scheme. Access `original_dataset_prepared_copy` in the Flow.
Then go to `Settings > Partitioning` and implement the partitioning scheme of ID followed by MONTH:
At the end of STEP 3 the Flow is the same as at the end of STEP 2.
STEP 4 - Go back to the Sync recipe. Now that you have defined a desired partitioning scheme directly downstream from the Sync recipe you should see a checkbox labeled “Redispatch partitioning according to input columns” in the UI of the Sync recipe
Check the box labeled “Redispatch partitioning according to input columns” and click Run. This will kick off the job that takes input dataset `original_dataset_prepared` and updates the layout on disk of output dataset `original_dataset_prepared_copy` to conform to the partitioning scheme defined in STEP 3.
At the end of step 4 the Flow looks the same as at the end of Step 3 with the only difference being `original_dataset_prepared_copy` is now a partitioned dataset (i.e. if you go look at the dataset on disk you will notice a directory structure based on the partitioning scheme and the partitioning columns have been dropped from the output dataset).
At the end of steps 1-4 your Flow should look similar to this:
Note:
---
Specifically addressing your constraints: