Watch @Malick-K explain how to create partitioned datasets to increase performance and computation usage when dealing with large volumes of data.
As datasets become more voluminous over time, processing time grows to update the flow with fresh incoming data, run preparation steps, and retrain models. Partitioning helps solve the issue. By splitting a dataset into subsets along meaningful dimensions (time or discrete dimensions), it leads to build the flow for the incremental data only - while keeping the historical data as it is.
Malick Konate (Data Scientist, Dataiku) will explain in detail what partitioning is and how DSS users can use it to increase computation performances while dealing with large volumes of data. Using the example of a retail company, he will walk us through how this can be used to build historical data, target data processes on new data, and train a partitioned machine learning model for each country. This will also be an opportunity to share best practices and common pitfalls of managing dependencies.
Note: Partitioning is not available in the Community edition of Dataiku DSS.
Malick started in the data ecosystem with business intelligence projects in data engineering and data visualization. He is now Data Scientist at Dataiku in Paris, where he supports our customers in building efficient data science projects and deploying them into production.