persistent python dictionary in dataiku folder
I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.
I have saved them in a really big dictionary.
However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.
In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.
How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @pkansal
,It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.
To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html
To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory
my_part_dataset = dataiku.Dataset("mydataset")
my_part_dataset.add_read_partitions(["actual_customer_id"])
-
I have 4 million customers. Not sure, having 4 million partitions is a good idea.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
4 million would be excessive for a number of partitions.
If you have another dimension( e.g state/country/industry) that can split your data into more reasonable parts this approach may still work with partitions.