Save ML Lib pipeline model in pyspark recipe to hdfs managed folder without using local file system

AnnaProba
AnnaProba Registered Posts: 12 ✭✭✭

I can't use a Dataiku Lab feature to train our model for various reasons, and I need to do it in a pyspark recipe (spark submit).

I am training an ML Lib GBTRegressor. Once the pipeline model is trained, I would like to save it. I have no access to the local filesystem (our IT policies). I also don't have access to hdfs via direct path (hdfs://), so it has to go to a managed folder created on hdfs.

A document explaining saving files to managed folders on hdfs (https://knowledge.dataiku.com/latest/code/managed-folders/concept-managed-folders.html) says I must first save the model to the local file system and then upload_stream to the hdfs managed folder. But as I said above, I have no access to the local filesystem. So, how do I save an MLLib model to a managed folder in hdfs without using the local filesystem as an intermediary?

Comments

  • AnnaProba
    AnnaProba Registered Posts: 12 ✭✭✭

    Here is some example code for my question :

    assembler = VectorAssembler(inputCols=feature_columns, outputCol="featuresVec")

    gbt = GBTRegressor(labelCol="label", featuresCol="featuresVec", maxIter=150)

    pipeline = Pipeline(stages=[assembler, gbt])
    my_model = pipeline.fit(df_train_set)

    I want to save my_model to a managed folder my_managed_folder on hdfs

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,876 Neuron
    edited August 25

    You can try to use a tmp directory as shown in this example:

    https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples

    You will have access to this local tmp directory as all *nix processes have access to the local tmp folder.

  • AnnaProba
    AnnaProba Registered Posts: 12 ✭✭✭

    Thanks for the hint.

    Your example is for the inverse problem where a model already exists in a remote-managed folder, and a user wants to extract it in a recipe. It will be helpful when I want to reuse the mode. Right now, however, my problem is the opposite. I want to save it in a remote-managed folder.

    My code runs in a recipe pyspark via spark-submit (not in a notebook), so the ML Lib model is created and exists on a remote cluster. The challenge is to save this model for future use in another pyspark, spark-submit recipe.

    Since my last message, I have done some experiments. I have managed to save the model to hdfs://tmp/my_model

    model.write().overwrite().save("hdfs://tmp/my_model")

    and I can also load it from this location. How do I move it now to a managed folder on hdfs? I can't keep the model in the tmp folder forever.

    Or how do I save it directly to the remote-managed folder, which I don't know the path to because Dataiku doesn't provide it?

Setup Info
    Tags
      Help me…