persistent python dictionary in dataiku folder

pkansal
Level 3
persistent python dictionary in dataiku folder

I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.

I have saved them in a really big dictionary.

However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.

In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.

 

How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.

0 Kudos
3 Replies
AlexT
Dataiker

Hi @pkansal ,

It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.

To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html

 

To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory 

my_part_dataset = dataiku.Dataset("mydataset")

my_part_dataset.add_read_partitions(["actual_customer_id"])

 

0 Kudos
pkansal
Level 3
Author

I have 4 million customers. Not sure, having 4 million partitions is a good idea.

0 Kudos
AlexT
Dataiker

4 million would be excessive for a number of partitions.

If you have another dimension( e.g state/country/industry)  that can split your data into more reasonable parts this approach may still work with partitions. 

0 Kudos