Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.
I have saved them in a really big dictionary.
However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.
In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.
How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.
Hi @pkansal ,
It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.
To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html
To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory
my_part_dataset = dataiku.Dataset("mydataset")
my_part_dataset.add_read_partitions(["actual_customer_id"])
I have 4 million customers. Not sure, having 4 million partitions is a good idea.
4 million would be excessive for a number of partitions.
If you have another dimension( e.g state/country/industry) that can split your data into more reasonable parts this approach may still work with partitions.