Save ML Lib pipeline model in pyspark recipe to hdfs managed folder without using local file system

AnnaProba · August 25

I can't use a Dataiku Lab feature to train our model for various reasons, and I need to do it in a pyspark recipe (spark submit).

I am training an ML Lib GBTRegressor. Once the pipeline model is trained, I would like to save it. I have no access to the local filesystem (our IT policies). I also don't have access to hdfs via direct path (hdfs://), so it has to go to a managed folder created on hdfs.

A document explaining saving files to managed folders on hdfs (https://knowledge.dataiku.com/latest/code/managed-folders/concept-managed-folders.html) says I must first save the model to the local file system and then upload_stream to the hdfs managed folder. But as I said above, I have no access to the local filesystem. So, how do I save an MLLib model to a managed folder in hdfs without using the local filesystem as an intermediary?

AnnaProba · August 25

Here is some example code for my question :

assembler = VectorAssembler(inputCols=feature_columns, outputCol="featuresVec")

gbt = GBTRegressor(labelCol="label", featuresCol="featuresVec", maxIter=150)

pipeline = Pipeline(stages=[assembler, gbt])
my_model = pipeline.fit(df_train_set)

I want to save my_model to a managed folder my_managed_folder on hdfs

Turribeach · August 25

You can try to use a tmp directory as shown in this example:

https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples

You will have access to this local tmp directory as all *nix processes have access to the local tmp folder.

AnnaProba · August 26

Thanks for the hint.

Your example is for the inverse problem where a model already exists in a remote-managed folder, and a user wants to extract it in a recipe. It will be helpful when I want to reuse the mode. Right now, however, my problem is the opposite. I want to save it in a remote-managed folder.

My code runs in a recipe pyspark via spark-submit (not in a notebook), so the ML Lib model is created and exists on a remote cluster. The challenge is to save this model for future use in another pyspark, spark-submit recipe.

Since my last message, I have done some experiments. I have managed to save the model to hdfs://tmp/my_model

model.write().overwrite().save("hdfs://tmp/my_model")

and I can also load it from this location. How do I move it now to a managed folder on hdfs? I can't keep the model in the tmp folder forever.

Or how do I save it directly to the remote-managed folder, which I don't know the path to because Dataiku doesn't provide it?

Turribeach · September 12

I am aware the sample was for the inverse case but the sample was meant to give you the workaround for your problem. Once you have the model saved in temp directory you can use the folder.get_download_stream() and folder.upload_stream() / folder.upload_data() managed folder methods to read and write the model to a Dataiku managed folder. This Dataiku managed folder can be stored in a cloud bucket or any supported connections for managed folders.

But before giving you some code some context. I can see that the knowledge base article link you provided has misguided you. It is patently wrong that you can only obtain the path of managed folder items stored in a "local" folder. You can obtain them easily using the correct methods with the correct DSS objects. So in reality the issue is that some methods don't support remote folders. There are basically two sets of Dataiku APIs (see here) and while you are supposed to use the dataiku package for "internal operations" and the dataikuapi for "external operations" both can be used remotely. Crucially to read and write files to a managed folder you need to use the the "internal" dataiku package since the dataikuapi dataikuapi.dss.managedfolder.DSSManagedFolder() class lacks the methods defined in the dataiku.Folder class to do that (see here).

Below is a code sample using the internal dataiku API package which connects to a remote DSS server, creates a pickle file and uploads it to a managed folder which is stored in a GCP bucket in Google Cloud. This code was executed from a Jupyter Notebook running outside of DSS so it would run in any machine provided you have connectivity to the DSS server and that the correct Python packages are installed.

import os
from datetime import datetime
dataiku.set_remote_dss("http://dss_server:11200/", "your API key")
os.environ["DKU_CURRENT_PROJECT_KEY"] = "your project key"
folder = dataiku.Folder("your folder name")

import pickle
import numpy as np

# Create dummy 10x10 numpy array
numpy_array = np.ones((10,10))
print(numpy_array)

# Serialize numpy array into bytes object
pickle_bytes_data = pickle.dumps(numpy_array)

# Write serialized numpy array to bucket folder
file_name = '/new_folder/pickle_file.pkl'
folder.upload_data(file_name, pickle_bytes_data)
folder.list_paths_in_partition()

# Download pickle file from bucket folder
with folder.get_download_stream(file_name) as f:
    numpy_array_download = pickle.load(f)

print(numpy_array_download)

type(numpy_array_download)

I am adding a screen shot so you can see how the code executes. You can clearly see that the path of the file is not only visible but that you can add any subfolders at your discretion too.

In this case I am not saving the pickle file I download back from the folder to any file, I just display it. But you can use the trick I linked to you to save the model into a temporary file.

Save ML Lib pipeline model in pyspark recipe to hdfs managed folder without using local file system

Comments

Categories

Setup Info

Tags