Shapley computation trought Spark

NunziaCoppola Registered Posts: 3

Hello everyone,

in Shapley computation using UDF we encouter memory issues or very long processing time.

Tryng to redefine the process with pySpark recipe we use the method "GBTClassificationModel.load" from the library "from import GBTClassificationModel" in custom code.

Unfortunately the load method takes too long (more than 4h) to upload the model.

Do you know how to speed up the process or work directly with the model without loading it?

I leave attached the code snippet



  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,740 Neuron

    Hi, can you please paste the code in a Code Block (look for the </> icon) so it can be copy/pasted? Thanks

  • NunziaCoppola
    NunziaCoppola Registered Posts: 3
    edited July 17
    from pyspark.sql import SparkSession
    from pyspark import SparkContext, SparkConf
    from import GBTClassificationModel
    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    import pandas as pd
    import dataiku
    import json
    from pyspark.sql import SQLContext
    import gzip
    import dataikuscoring
    import os.path
    import os
    import sys
    from dataikuscoring import load_model
    os.environ['PYSPARK_PYTHON'] = sys.executable
    os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
    FOLDER_ID = 'EnO4P0lZ'
    REPO = dataiku.Folder(FOLDER_ID)
    list_files = REPO .list_paths_in_partition()
    with REPO.get_download_stream(list_files[0]) as stream:
        data =
    dataset = dataiku.Dataset("ew_sconfini_scadutiprivati_prepared")
    df = dataset.get_dataframe()
    gbt = GBTClassificationModel.load("/DKU_TUTORIAL_BASICS_102_1/EnO4P0lZ/privati-ima_01-weights/model/dss_pipeline_model")
Setup Info
      Help me…