Save ML Lib pipeline model in pyspark recipe to hdfs managed folder without using local file system
I can't use a Dataiku Lab feature to train our model for various reasons, and I need to do it in a pyspark recipe (spark submit).
I am training an ML Lib GBTRegressor. Once the pipeline model is trained, I would like to save it. I have no access to the local filesystem (our IT policies). I also don't have access to hdfs via direct path (hdfs://), so it has to go to a managed folder created on hdfs.
A document explaining saving files to managed folders on hdfs (https://knowledge.dataiku.com/latest/code/managed-folders/concept-managed-folders.html) says I must first save the model to the local file system and then upload_stream to the hdfs managed folder. But as I said above, I have no access to the local filesystem. So, how do I save an MLLib model to a managed folder in hdfs without using the local filesystem as an intermediary?
Comments
-
Here is some example code for my question :
assembler = VectorAssembler(inputCols=feature_columns, outputCol="featuresVec")
gbt = GBTRegressor(labelCol="label", featuresCol="featuresVec", maxIter=150)
pipeline = Pipeline(stages=[assembler, gbt])
my_model = pipeline.fit(df_train_set)I want to save my_model to a managed folder my_managed_folder on hdfs
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,876 Neuron
You can try to use a tmp directory as shown in this example:
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples
You will have access to this local tmp directory as all *nix processes have access to the local tmp folder.
-
Thanks for the hint.
Your example is for the inverse problem where a model already exists in a remote-managed folder, and a user wants to extract it in a recipe. It will be helpful when I want to reuse the mode. Right now, however, my problem is the opposite. I want to save it in a remote-managed folder.
My code runs in a recipe pyspark via spark-submit (not in a notebook), so the ML Lib model is created and exists on a remote cluster. The challenge is to save this model for future use in another pyspark, spark-submit recipe.
Since my last message, I have done some experiments. I have managed to save the model to hdfs://tmp/my_model
model.write().overwrite().save("hdfs://tmp/my_model")
and I can also load it from this location. How do I move it now to a managed folder on hdfs? I can't keep the model in the tmp folder forever.
Or how do I save it directly to the remote-managed folder, which I don't know the path to because Dataiku doesn't provide it?