As datasets become more voluminous over time, processing time grows to update the flow with fresh incoming data, run preparation steps, and retrain models. Partitioning helps solve the issue. Partitioning refers to the splitting of the dataset along meaningful dimensions. Each partition contains a subset of the dataset.
By splitting a dataset into subsets along meaningful dimensions: time (ex: year, month, day or hour) or discrete (ex: country, business unit, etc.), it leads to building the flow for the incremental data only - while keeping the historical data as it is.