Sample Data For Training
Curious
Registered Posts: 2 ✭
Hi everyone,
I’m currently working on a machine learning project in Dataiku with around 1 million rows of training data.
When I try to train using the full dataset, the process crashes due to out-of-memory issues. I noticed that Dataiku provides several sampling options (e.g., using the first 10,000 rows, using an appropriate ratio, etc.).
I’m considering using these sampling options, but I’m concerned about their impact on model performance.
My questions are:
- How much does sampling (e.g., 10k rows or ratio-based sampling) typically affect model performance?
- Is there a recommended strategy to balance between memory constraints and model accuracy in Dataiku?
- Are there best practices for handling large datasets like this (e.g., chunking, distributed training, or specific configurations)?
Thanks in advance!