I'm working on L2 Cert and trying to perform a some analysis for the new york city taxi fare prediction challenge. https://www.kaggle.com/c/new-york-city-taxi-fare-prediction.
The Training set is a 5GB csv file and it is running relatively slow on my local machine when creating models, even just doing a group by recipe.
What would be a proper way to import / create this dataset such that it can make the machine run faster with my model.
I want to be able to run more than 10,000 sample rows without waiting for a long period of time.
It would be expected that doing operations on such a large dataset without a dedicated computation engine (i.e. using just your local machine) would take some time.
What kind of time are we talking about for the grouping ?
I created a scenario to build a simple job (i.e. two prepare recipe that performs some calculations and value changes). Here are some statistics and my machine:
the train data contains 55 million rows (CSV format)
Prepare Job - It took 55m 36s to run.
Group by function - took 10m 49s to run.
Training - Tried to train the whole datasets - Training Failed - process died (exit code: 137, maybe out of memory ?)
My main question:
1. what is the best practices when running the flows with these kind of datasets
2. What can I do to perform the training job -> Am I able to configure the Train recipe so it can do batch trainings so it doesnt run out of memory?
Specs of the computer:
CPU: Ryzen 9 3900x 12-core processor x 24 threads
OS: POP!_OS 19.10 (ubuntu 19.10)
RAM: 32 GB
Graphics: RTX 2070 Super
The speed is more or less what would be expected given your input file. You could get better performance if your input dataset is split out in several files rather than a single big file, as DSS would then be able to parallelize locally.
For the machine learning part, the data must be loaded in memory. However, we'd strongly advise you not to train on 55 million records, and instead to take a sample (the visual ML can do that automatically for you). Our experience shows that it's very rare to require more than a few hundreds of thousands of records in order to train a model. The incremental gains from taking more data are most often tiny.
Hi @Clément_Stenac ,
Thanks for the response. I went with the approach and split the number of data going into DSS which helped a lot on the running speed and without compromising much of the quality of the model.
Some fun statistics while I was running the large datasets:
- Training a Random Forest model with 5 million rows - 2 hrs 30 minutes (I was quite impressed!)