Storing Model used by both Python Recipe and Jupyter Notebook

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

The questions are down at the bottom. A bit of context first.

I'm working with a Python Library that by default put a library-specific model file into the current working directory. That works OK when I'm working with a Jupyter Notebook. But when I save the Jupyter Notebook as a Python Recipe and try to run the same code the working directory is different and the code fails...

So I thought to myself, Hmmmm... That no good...

Where can I save this model file that can be accessible to both the Jupyter Notebook and the Python Code Recipe?

So I thought well maybe these files could go into the library for the project.

But where is the library stored on my DSS design node? I eventually found the absolute path to the library for this specific node and project. I then re-wrote the code and yes I can write the model file to the DSS Library from a Jupyter Notebook and read the same files from the library and use it in a Python Recipe.

Great.

However what happens when I duplicate the project. Move the Project to another DSS Instance. (And the list of all of the places that the absolute path of of the files will be different.

Question:

  • Is there a Call in the Dataiku Library that will give me the file path at the top of the library directory?
  • In the Scenario above where a Jupyter Notebook and it's related Python recipe need to share files that are not code files. What is the best way of setting this up, that is most consistent with the Dataiku DSS way of doing things?
Tagged:

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17 Answer ✓

    This is an example of using the managed folder to store some files (that are not read by DSS), but you can also use it to read from it with other script or notebook, or API. The files there do not affect the behaviour of DSS at all, a managed folder is not a dataset. We store configuration files and optional inputs in this kind of constructs.

    Screenshot 2021-09-08 104223.jpg

    The 'local_folder' is where we read and write to (it doesn't matter if it is not declared as output in the recipe):

    # only library really needed
    import dataiku 
    
    # all imports next are relevant to our particular use case and libraries.
    import pandas as pd, numpy as np
    import os
    from dsa.database import DSADatabase
    from sqlalchemy import create_engine
    
    # connect to the local folder
    local_folder = dataiku.Folder("local_folder")
    local_folder_info = local_folder.get_info()
    # get the path within the file system of DSS
    path = local_folder.get_path()
    
    db = DSADatabase(
        config = {
            'alma_archive': {'service_name': 'SERVICE.SCO.CL'}
        },
        workers_num = 4,
        # here we provide the path to the folder, and a directory to store the code work
        working_dir =  os.path.join(path, "dsa_db_working_dir")
    )

    I think this is the way you should follow.

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17

    Hi Tom, @tgb417

    • To get the file path at the top of the library directory I've use this line of code within a notebook:

    $DSS_DATA_DIR/config/projects/YOURDSSPROJECT/lib/python

    But I'm not sure is bulletproof, in the sense the library directory might not be the last entry in the PYTHONPATH variable, but for sure it will be there, with a pattern like

    import os
    libpath = os.environ['PYTHONPATH'][-1].split(':')[-1]
    • What we do to deal with the use case you are describing, is to create a managed folder, and set the paths to write and read information from there, and this will work for both the python script and the notebook

    I hope this helps a bit!

    Ignacio

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Ignacio_Toledo

    I hope you are having a wonderful beginning of spring where you are.

    I tried your suggestion:

    Finding a Projects Library Path.png

    I'll poke around with this a bit more. Right now It looks like I'm correctly getting the last letter of the string path. Not the whole last path entry.

    I saw the idea of a Managed folder in some of my exploration.

    However, I'm not clear about how to set this up in a way that an API call will allow me to access the path of the managed folder in a consistent manner.

    The files that I need to put in this location can not be used by DSS directly. They are not going to be in a format recognizable as a standard data type. Will that cause DSS to choke?

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    edited July 17

    The following more reliably produced the path of the top of the library. I'm on an AWS-based Redhat Linux environment.

    import os
    libpath = os.environ.get('PYTHONPATH', '').split(':')[-1]
    libpath

    However when I moved to my Macintosh-based DSS that has gone through lots of DSS upgrades. I did not get to the top of the library path. I actually ended up getting a subpath of the library. I would have had to use an [-2] above to get the correct path location. So as you suspected maybe not so reliable a choice. However a bit better than my hard-coded version for sure.

    However, I'm still interested in learning more about using the managed folder.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi Tom,

    Yes, sorry about the first code (I corrected it now): I forgot the output of os.environ was a string and not a list.

    About the use of a managed folder, you don't have to worry about the format of the files or anything: is just a plain file system folder access, so you can access files (of any kind) from anywhere withing DSS. Give one moment and I post an example.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    This seems to work well. And avoids the lack of a call to find the base directory of a library.

    Might this approach also work with other kinds of data repositories like Amazon S3, or in a containerized scenario? I looks like it may be more OS Agnostic as well.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @tgb417
    ,

    This approach could work also if your managed folder lives in an external server connected through SFTP, or if it lives in HDFS or S3, but the way that you read and write from those is different (check https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local).

    I've not tested it in a containerized scenario, so I can't give you a correct answer, but I'd expect it would work.

    Cheers!

Setup Info
    Tags
      Help me…