Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Shapley computation trought Spark

Level 1
Shapley computation trought Spark

Hello everyone, 

in Shapley computation using UDF we encouter memory issues or very long processing time. 

Tryng to redefine the process with pySpark recipe we use the method "GBTClassificationModel.load" from the library "from import GBTClassificationModel" in custom code. 

Unfortunately the load method takes too long (more than 4h) to upload the model. 

Do you know how to speed up the process or work directly with the model without loading it? 

I leave attached the code snippet 

0 Kudos
2 Replies

Hi, can you please paste the code in a Code Block (look for the </> icon) so it can be copy/pasted? Thanks

Level 1
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from import GBTClassificationModel
import pyspark.sql.functions as F
from pyspark.sql.types import *
import pandas as pd

import dataiku

import json

from pyspark.sql import SQLContext
import gzip
import dataikuscoring

import os.path
import os
import sys

from dataikuscoring import load_model

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable


REPO = dataiku.Folder(FOLDER_ID)
list_files = REPO .list_paths_in_partition()
with REPO.get_download_stream(list_files[0]) as stream:
    data =
dataset = dataiku.Dataset("ew_sconfini_scadutiprivati_prepared")
df = dataset.get_dataframe()

gbt = GBTClassificationModel.load("/DKU_TUTORIAL_BASICS_102_1/EnO4P0lZ/privati-ima_01-weights/model/dss_pipeline_model")