Storing Model used by both Python Recipe and Jupyter Notebook

tgb417 · ‎09-07-2021

The questions are down at the bottom. A bit of context first.

I'm working with a Python Library that by default put a library-specific model file into the current working directory. That works OK when I'm working with a Jupyter Notebook. But when I save the Jupyter Notebook as a Python Recipe and try to run the same code the working directory is different and the code fails...

So I thought to myself, Hmmmm... That no good...

Where can I save this model file that can be accessible to both the Jupyter Notebook and the Python Code Recipe?

So I thought well maybe these files could go into the library for the project.

But where is the library stored on my DSS design node? I eventually found the absolute path to the library for this specific node and project. I then re-wrote the code and yes I can write the model file to the DSS Library from a Jupyter Notebook and read the same files from the library and use it in a Python Recipe.

Great.

However what happens when I duplicate the project. Move the Project to another DSS Instance. (And the list of all of the places that the absolute path of of the files will be different.

Question:

Is there a Call in the Dataiku Library that will give me the file path at the top of the library directory?
In the Scenario above where a Jupyter Notebook and it's related Python recipe need to share files that are not code files. What is the best way of setting this up, that is most consistent with the Dataiku DSS way of doing things?

--Tom

Ignacio_Toledo · ‎09-08-2021

This is an example of using the managed folder to store some files (that are not read by DSS), but you can also use it to read from it with other script or notebook, or API. The files there do not affect the behaviour of DSS at all, a managed folder is not a dataset. We store configuration files and optional inputs in this kind of constructs.

The 'local_folder' is where we read and write to (it doesn't matter if it is not declared as output in the recipe):

# only library really needed
import dataiku 

# all imports next are relevant to our particular use case and libraries.
import pandas as pd, numpy as np
import os
from dsa.database import DSADatabase
from sqlalchemy import create_engine

# connect to the local folder
local_folder = dataiku.Folder("local_folder")
local_folder_info = local_folder.get_info()
# get the path within the file system of DSS
path = local_folder.get_path()

db = DSADatabase(
    config = {
        'alma_archive': {'service_name': 'SERVICE.SCO.CL'}
    },
    workers_num = 4,
    # here we provide the path to the folder, and a directory to store the code work
    working_dir =  os.path.join(path, "dsa_db_working_dir")
)

I think this is the way you should follow.

View solution in original post

Ignacio_Toledo · ‎09-08-2021

Hi Tom, @tgb417

To get the file path at the top of the library directory I've use this line of code within a notebook:

import os
libpath = os.environ['PYTHONPATH'][-1].split(':')[-1]

But I'm not sure is bulletproof, in the sense the library directory might not be the last entry in the PYTHONPATH variable, but for sure it will be there, with a pattern like

$DSS_DATA_DIR/config/projects/YOURDSSPROJECT/lib/python

What we do to deal with the use case you are describing, is to create a managed folder, and set the paths to write and read information from there, and this will work for both the python script and the notebook

I hope this helps a bit!

Ignacio

tgb417 · ‎09-08-2021

@Ignacio_Toledo

I hope you are having a wonderful beginning of spring where you are.

I tried your suggestion:

DSS Jupyter Notebook cell with the following code: import os libpath = os.environ['PYTHONPATH'][-1] libpath. That produces the result of a string with the letter 'n''

I'll poke around with this a bit more. Right now It looks like I'm correctly getting the last letter of the string path. Not the whole last path entry.

I saw the idea of a Managed folder in some of my exploration.

However, I'm not clear about how to set this up in a way that an API call will allow me to access the path of the managed folder in a consistent manner.

The files that I need to put in this location can not be used by DSS directly. They are not going to be in a format recognizable as a standard data type. Will that cause DSS to choke?

--Tom

tgb417 · ‎09-08-2021

The following more reliably produced the path of the top of the library. I'm on an AWS-based Redhat Linux environment.

import os
libpath = os.environ.get('PYTHONPATH', '').split(':')[-1]
libpath

However when I moved to my Macintosh-based DSS that has gone through lots of DSS upgrades. I did not get to the top of the library path. I actually ended up getting a subpath of the library. I would have had to use an [-2] above to get the correct path location. So as you suspected maybe not so reliable a choice. However a bit better than my hard-coded version for sure.

However, I'm still interested in learning more about using the managed folder.

--Tom

Ignacio_Toledo · ‎09-08-2021

This is an example of using the managed folder to store some files (that are not read by DSS), but you can also use it to read from it with other script or notebook, or API. The files there do not affect the behaviour of DSS at all, a managed folder is not a dataset. We store configuration files and optional inputs in this kind of constructs.

The 'local_folder' is where we read and write to (it doesn't matter if it is not declared as output in the recipe):

# only library really needed
import dataiku 

# all imports next are relevant to our particular use case and libraries.
import pandas as pd, numpy as np
import os
from dsa.database import DSADatabase
from sqlalchemy import create_engine

# connect to the local folder
local_folder = dataiku.Folder("local_folder")
local_folder_info = local_folder.get_info()
# get the path within the file system of DSS
path = local_folder.get_path()

db = DSADatabase(
    config = {
        'alma_archive': {'service_name': 'SERVICE.SCO.CL'}
    },
    workers_num = 4,
    # here we provide the path to the folder, and a directory to store the code work
    working_dir =  os.path.join(path, "dsa_db_working_dir")
)

I think this is the way you should follow.

Ignacio_Toledo · ‎09-08-2021

Hi Tom,

Yes, sorry about the first code (I corrected it now): I forgot the output of os.environ was a string and not a list.

About the use of a managed folder, you don't have to worry about the format of the files or anything: is just a plain file system folder access, so you can access files (of any kind) from anywhere withing DSS. Give one moment and I post an example.

tgb417 · ‎09-09-2021

This seems to work well. And avoids the lack of a call to find the base directory of a library.

Might this approach also work with other kinds of data repositories like Amazon S3, or in a containerized scenario? I looks like it may be more OS Agnostic as well.

🙂

--Tom

Ignacio_Toledo · ‎09-09-2021

Hi @tgb417,

This approach could work also if your managed folder lives in an external server connected through SFTP, or if it lives in HDFS or S3, but the way that you read and write from those is different (check https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local).

I've not tested it in a containerized scenario, so I can't give you a correct answer, but I'd expect it would work.

Cheers!

Sign up to take part

Storing Model used by both Python Recipe and Jupyter Notebook

Storing Model used by both Python Recipe and Jupyter Notebook

Labels