Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.
I have saved them in a really big dictionary.
However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.
In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.
How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.
Hi @pkansal ,
It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.
To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html
To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory
my_part_dataset = dataiku.Dataset("mydataset")
4 million would be excessive for a number of partitions.
If you have another dimension( e.g state/country/industry) that can split your data into more reasonable parts this approach may still work with partitions.