persistent python dictionary in dataiku folder

PARTEEK · April 2022

I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.

I have saved them in a really big dictionary.

However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.

In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.

How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.

Alexandru · April 2022

Hi @pkansal
,

It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.

To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html

To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory

my_part_dataset = dataiku.Dataset("mydataset")

my_part_dataset.add_read_partitions(["actual_customer_id"])

PARTEEK · April 2022

I have 4 million customers. Not sure, having 4 million partitions is a good idea.

Alexandru · April 2022

4 million would be excessive for a number of partitions.

If you have another dimension( e.g state/country/industry) that can split your data into more reasonable parts this approach may still work with partitions.

persistent python dictionary in dataiku folder

Answers

Categories

Setup Info

Tags