Sub-partitioning
Hi,
I have a dataset containing hourly rain data that is discretely partitioned following IDs of pluviometers (1000 IDs).
I'd like to sub-partition this dataset by month (i.e. : /%{ID}/%Y/%M/.*), but when I try to repartition it :
_Can't use the redispatch function
_My ID column disappear
_My DATETIME column disappear
_Processing is extremely long
How can I manage all these constraints ?
Regards,
Amir
Operating system used: Linux
Answers
-
Hello @ACB
,It is possible to implement a partitioning scheme involving a discrete dimension followed by a time dimension.
The initial state I’m envisioning is a dataset that was uploaded from a single csv file containing the pluviometer data.
From this initial state here are the steps you would need to perform to implement DSS-based partitioning on this dataset using a discrete dimension (ID) followed by a time dimension (month).
STEP 1 - Add a Prepare recipe (https://doc.dataiku.com/dss/latest/other_recipes/prepare.html) to the original dataset. In the prepare recipe do the following:
- Use parse date processor (https://doc.dataiku.com/dss/latest/preparation/dates.html?#parse-date-processor) to parse the date column you wish to use in the partition key (this will allow for time-based partitioning on the date column)
- Use the copy column processor (https://doc.dataiku.com/dss/latest/preparation/processors/column-copy.html) to create copies of the partition key columns (i.e. create copies of the ID column and the date column that will be used as the partition key). This step is necessary to preserve the partition key columns in the output dataset after partitioning is performed on the original dataset. Doing this will address your constraint about the columns being dropped from the output dataset.
At end of STEP 1 the Flow looks like: `original_dataset` -> `prepare recipe` -> `original_dataset_prepared`
STEP 2 - Add a Sync recipe (https://doc.dataiku.com/dss/latest/other_recipes/sync.html) to `original_dataset_prepared` to create a copy of `original_dataset_prepared`
At the end of STEP 2 the Flow looks like: `original_dataset` -> `Prepare recipe` -> `original_dataset_prepared` -> `Sync recipe` -> `original_dataset_prepared_copy`
STEP 3 - Define the desired partitioning scheme. Access `original_dataset_prepared_copy` in the Flow.
Then go to `Settings > Partitioning` and implement the partitioning scheme of ID followed by MONTH:- Click Activate Partitioning
- Add discrete dimension, choose name of the ID column
- Add time dimension, choose name the date column
- Pattern field should look like: `%{ID_column_name}/%Y/%M/.*`
At the end of STEP 3 the Flow is the same as at the end of STEP 2.
STEP 4 - Go back to the Sync recipe. Now that you have defined a desired partitioning scheme directly downstream from the Sync recipe you should see a checkbox labeled “Redispatch partitioning according to input columns” in the UI of the Sync recipe
Check the box labeled “Redispatch partitioning according to input columns” and click Run. This will kick off the job that takes input dataset `original_dataset_prepared` and updates the layout on disk of output dataset `original_dataset_prepared_copy` to conform to the partitioning scheme defined in STEP 3.At the end of step 4 the Flow looks the same as at the end of Step 3 with the only difference being `original_dataset_prepared_copy` is now a partitioned dataset (i.e. if you go look at the dataset on disk you will notice a directory structure based on the partitioning scheme and the partitioning columns have been dropped from the output dataset).
At the end of steps 1-4 your Flow should look similar to this:
Note:
- We preserved the partitioning columns in the output dataset by making copies of them in the upstream Prepare recipe.
- Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
- I recommend this free course on DSS partitioning: https://academy.dataiku.com/path/advanced-designer/advanced-partitioning
---
Specifically addressing your constraints:
- You should be able to use the redispatch function in a Sync recipe when the dataset directly downstream from the Sync recipe has partitioning enabled.
- You can prevent the partition key columns from disappearing from the final output dataset by creating copies of the those columns via the Prepare recipe. Note: Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
- I’m not sure what levers (if any) you’ll have for improving performance.
- What’s the schema of the original dataset and how many rows are in it?
- Which storage are you using for the input dataset?
- Which storage are you using for the output dataset?
- How long does the redispatch job take to complete?