persistent python dictionary in dataiku folder

Options
pkansal
pkansal Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 23 ✭✭✭✭

I have thousands of models in a python dictionary where keys represent customer numbers and values are machine learning models.

I have saved them in a really big dictionary.

However, I need to make hourly predictions on a subset of customers and I have to load the full dictionary in memory every hour before I can make any predictions.

In Python, you can use the shelve module to store a persistent dictionary and only fetch keys that you need.

How can I accomplish something similar in Dataiku. My managed folder uses S3 as backend storage.

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @pkansal
    ,

    It does sound like you could leverage a partitioned dataset by your customer number? Then only read the partition for that custom number similar to what you are doing with shelve.

    To create a partitioned dataset in S3 you can use redispatch https://knowledge.dataiku.com/latest/kb/data-prep/partitions/partitioning-redispatch.html

    To retrieve a particular partition e.g use the actual customer ID and this will avoid brining the whole dataset into memory

    my_part_dataset = dataiku.Dataset("mydataset")

    my_part_dataset.add_read_partitions(["actual_customer_id"])

  • pkansal
    pkansal Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 23 ✭✭✭✭
    Options

    I have 4 million customers. Not sure, having 4 million partitions is a good idea.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    4 million would be excessive for a number of partitions.

    If you have another dimension( e.g state/country/industry) that can split your data into more reasonable parts this approach may still work with partitions.

Setup Info
    Tags
      Help me…