Storing Model used by both Python Recipe and Jupyter Notebook
The questions are down at the bottom. A bit of context first.
I'm working with a Python Library that by default put a library-specific model file into the current working directory. That works OK when I'm working with a Jupyter Notebook. But when I save the Jupyter Notebook as a Python Recipe and try to run the same code the working directory is different and the code fails...
So I thought to myself, Hmmmm... That no good...
Where can I save this model file that can be accessible to both the Jupyter Notebook and the Python Code Recipe?
So I thought well maybe these files could go into the library for the project.
But where is the library stored on my DSS design node? I eventually found the absolute path to the library for this specific node and project. I then re-wrote the code and yes I can write the model file to the DSS Library from a Jupyter Notebook and read the same files from the library and use it in a Python Recipe.
Great.
However what happens when I duplicate the project. Move the Project to another DSS Instance. (And the list of all of the places that the absolute path of of the files will be different.
Question:
- Is there a Call in the Dataiku Library that will give me the file path at the top of the library directory?
- In the Scenario above where a Jupyter Notebook and it's related Python recipe need to share files that are not code files. What is the best way of setting this up, that is most consistent with the Dataiku DSS way of doing things?
Best Answer
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
This is an example of using the managed folder to store some files (that are not read by DSS), but you can also use it to read from it with other script or notebook, or API. The files there do not affect the behaviour of DSS at all, a managed folder is not a dataset. We store configuration files and optional inputs in this kind of constructs.
The 'local_folder' is where we read and write to (it doesn't matter if it is not declared as output in the recipe):
# only library really needed import dataiku # all imports next are relevant to our particular use case and libraries. import pandas as pd, numpy as np import os from dsa.database import DSADatabase from sqlalchemy import create_engine # connect to the local folder local_folder = dataiku.Folder("local_folder") local_folder_info = local_folder.get_info() # get the path within the file system of DSS path = local_folder.get_path() db = DSADatabase( config = { 'alma_archive': {'service_name': 'SERVICE.SCO.CL'} }, workers_num = 4, # here we provide the path to the folder, and a directory to store the code work working_dir = os.path.join(path, "dsa_db_working_dir") )
I think this is the way you should follow.
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi Tom, @tgb417
- To get the file path at the top of the library directory I've use this line of code within a notebook:
$DSS_DATA_DIR/config/projects/YOURDSSPROJECT/lib/python
But I'm not sure is bulletproof, in the sense the library directory might not be the last entry in the PYTHONPATH variable, but for sure it will be there, with a pattern like
import os libpath = os.environ['PYTHONPATH'][-1].split(':')[-1]
- What we do to deal with the use case you are describing, is to create a managed folder, and set the paths to write and read information from there, and this will work for both the python script and the notebook
I hope this helps a bit!
Ignacio
- To get the file path at the top of the library directory I've use this line of code within a notebook:
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I hope you are having a wonderful beginning of spring where you are.
I tried your suggestion:
I'll poke around with this a bit more. Right now It looks like I'm correctly getting the last letter of the string path. Not the whole last path entry.
I saw the idea of a Managed folder in some of my exploration.
However, I'm not clear about how to set this up in a way that an API call will allow me to access the path of the managed folder in a consistent manner.
The files that I need to put in this location can not be used by DSS directly. They are not going to be in a format recognizable as a standard data type. Will that cause DSS to choke?
--Tom
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
The following more reliably produced the path of the top of the library. I'm on an AWS-based Redhat Linux environment.
import os
libpath = os.environ.get('PYTHONPATH', '').split(':')[-1]
libpathHowever when I moved to my Macintosh-based DSS that has gone through lots of DSS upgrades. I did not get to the top of the library path. I actually ended up getting a subpath of the library. I would have had to use an [-2] above to get the correct path location. So as you suspected maybe not so reliable a choice. However a bit better than my hard-coded version for sure.
However, I'm still interested in learning more about using the managed folder.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi Tom,
Yes, sorry about the first code (I corrected it now): I forgot the output of os.environ was a string and not a list.
About the use of a managed folder, you don't have to worry about the format of the files or anything: is just a plain file system folder access, so you can access files (of any kind) from anywhere withing DSS. Give one moment and I post an example.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
This seems to work well. And avoids the lack of a call to find the base directory of a library.
Might this approach also work with other kinds of data repositories like Amazon S3, or in a containerized scenario? I looks like it may be more OS Agnostic as well.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @tgb417
,This approach could work also if your managed folder lives in an external server connected through SFTP, or if it lives in HDFS or S3, but the way that you read and write from those is different (check https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local).
I've not tested it in a containerized scenario, so I can't give you a correct answer, but I'd expect it would work.
Cheers!