An Introduction to Partitioning
On May 20th, we'll dive into partitions, with @Malick-K
, to increase performance and computation usage when dealing with large volumes of data.
As datasets become more voluminous over time, processing time grows to update the flow with fresh incoming data, run preparation steps, and retrain models. Partitioning helps solve the issue. Partitioning refers to the splitting of the dataset along meaningful dimensions. Each partition contains a subset of the dataset.
By splitting a dataset into subsets along meaningful dimensions: time (ex: year, month, day or hour) or discrete (ex: country, business unit, etc.), it leads to building the flow for the incremental data only - while keeping the historical data as it is.
Note: Partitioning is not available in the Community edition of Dataiku DSS.
If you’re interested in learning more about Partitioning, please join us next week!
For more resources about Partitioning: