Experiment tracking - sharing between projects

Solved!
cooxky
Level 3
Experiment tracking - sharing between projects

I'm saving some metrics in "Experiment Tracking" using mlflow.

I am using default code from Dataiku docs in which you have to point mlflow to a certain managed folder:

 

import dataiku

project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('A_MANAGED_FOLDER_ID')

with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:

    # Note: if you don't call this (i.e. when no experiment is specified), the default one is used
    mlflow_handle.set_experiment("My first experiment")

    with mlflow_handle.start_run(run_name="my_run"):
        # ...your MLflow code...
        mlflow_handle.log_param("a", 1)
        mlflow_handle.log_metric("b", 2)

 

I shared this managed folder across several project, hoping that this will allow me to log results from different projects into one folder and then view all of them in the same "Experiment Tracking" GUI.

But when I try to log something from a project that shares access to this folder (folder is visible in the flow), I get: "managed folder does not exist".

 

I checked in a notebook and it's actually possible to run this:

 

managed_folder = project.get_managed_folder('A_MANAGED_FOLDER_ID')

 

But trying to run any operation on this folder produces mentioned error.

 

Is there a way to do what I'm trying to do?

Log metrics via mlflow from several projects to one folder and then display them in "Experiment Tracking" in one of the projects?


Operating system used: Amazon Linux

0 Kudos
1 Solution
Turribeach

The fact that the folder is shared from project A to project B it doesn't mean the folder exists in project B, it's the same folder which is being linked from project B to project A. In other words what you see in project B's flow is a reference to the folder in project A not a second folder. Therefore if you wish to interact with the project A folder under project B you need to use the correct project handle for that. So in your Notebook on project B you can't use

 

project = dataiku.api_client().get_default_project()

 

as this points to project B, you need to get a handle to project A where the folder exists:

 

folder_project_handle = client_handle.get_project('Project A')
managed_folder = folder_project_handle.get_managed_folder('A_MANAGED_FOLDER_ID')

 

Personally I would use a different approach to this as sharing folders like this is a bit messy. I would just create new folders in each project on the same file system connection and then use a symlink inside the folder to redirect them to a single file system location. That way all the projects can see the same location and write to it without having to use different project handles.

 

 

View solution in original post

4 Replies
Turribeach

The fact that the folder is shared from project A to project B it doesn't mean the folder exists in project B, it's the same folder which is being linked from project B to project A. In other words what you see in project B's flow is a reference to the folder in project A not a second folder. Therefore if you wish to interact with the project A folder under project B you need to use the correct project handle for that. So in your Notebook on project B you can't use

 

project = dataiku.api_client().get_default_project()

 

as this points to project B, you need to get a handle to project A where the folder exists:

 

folder_project_handle = client_handle.get_project('Project A')
managed_folder = folder_project_handle.get_managed_folder('A_MANAGED_FOLDER_ID')

 

Personally I would use a different approach to this as sharing folders like this is a bit messy. I would just create new folders in each project on the same file system connection and then use a symlink inside the folder to redirect them to a single file system location. That way all the projects can see the same location and write to it without having to use different project handles.

 

 

cooxky
Level 3
Author

Many thanks @Turribeach, works like a charm!

 

Could you please share some more guidelines on how would try to do it with symlink on Dataiku?

I'm assuming it would require some help from platform admin.

Currently my managed folder is on S3, but I could move it to server file system.

0 Kudos
Turribeach

Symlinks won't work under S3 so you would need either local storage or network storage attached to your DSS server. Yes this will need a platform admin to be setup. Let's say you have some local storage or network storage attached to your DSS server under /mnt/dataiku_shared. You then create a file  system connection pointing to /mnt/dataiku_shared. On project A you create a folder  name "logs_folder" (say /mnt/dataiku_shared/logs_folder_a) and on project B you create another folder (say /mnt/dataiku_shared/logs_folder_b). So now you go and create an actual OS lelvel logs folder under /mnt/dataiku_shared/logs. Then you do two symlinks as follows:

/mnt/dataiku_shared/folder_a/logs => /mnt/dataiku_shared/logs

/mnt/dataiku_shared/folder_b/logs => /mnt/dataiku_shared/logs

Since you can use the same folder name in different projects (even though the ID changes) you can get a handle to the folder using the folder name like this:

 

handle = dataiku.Folder("folder_name")

 

And then read/write to the "logs" subfolder inside this folder using the same code in both projects. In fact this sort of logging code could be part of a shared global library that all projects can use. 

 

 

0 Kudos
cooxky
Level 3
Author

Many thanks for the info @Turribeach .

0 Kudos