Reading subfolder content of managed folder in the code studios (streamlit app env, langchain)

raniachk
raniachk Dataiku DSS Core Designer, Registered Posts: 3
edited July 2024 in Using Dataiku

Hello,

I want to use the langchain library to create an application that reads files in subfolders of managed folders. We are creating embeddings and saving them in the "subsubfolder". When loading the embeddings in a notebook it works but not from a code studio.

import dataiku
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# saving the embeddings
handle = dataiku.Folder("my_folder")
folder_path = handle.get_path()
saving_directory = folder_path + "/subfolder/subsubfolder/"

When we try to read the embeddings this way in a jupyter notebook, it works well.

embeddings = OpenAIEmbeddings()
our_embeddings = FAISS.load_local(saving_directory, embeddings)

But when we call this to create an app in a code studio we get a FileNotFoundError: [Errno 2] No such file or directory: 'our_path' and we are not able to use get_download_stream() here since the library here needs access to a directory.

Does anyone have an idea on how to solve this?

Thank you


Operating system used: macOS/Windows

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
    edited July 2024

    Hi,
    Since Code Studio runs in a container, you must interact with managed folder using get_download_stream/upload_stream.

    If a library needs to interact with a local folder you can something like tempfile/tempdir or the current working directory( cwd) and then copy the files from the managed folder using get_download_stream locally them and later if needed to you use upload_streamt o persist files back to the managed folder.

    import dataiku
    import pandas as pd
    import numpy as np
    from langchain.embeddings import OpenAIEmbeddings
    from langchain.vectorstores import FAISS
    import os
    
    # Get the current working directory
    cwd = os.getcwd()
    
    # Create a subfolder named "embeddings" within the current working directory
    embeddings_dir = os.path.join(cwd, "embeddings")
    os.makedirs(embeddings_dir, exist_ok=True)
    
    # Copy all files from the managed folder to the "embeddings" subfolder
    embed_fldr = dataiku.Folder("H5DTgHOr")
    file_names = embed_fldr.list_paths_in_partition()
    for file_name in file_names:
        with embed_fldr.get_download_stream(file_name) as f:
            dst_path = os.path.join(embeddings_dir, os.path.basename(file_name))
            with open(dst_path, 'wb') as dst_file:
                dst_file.write(f.read())
    
    # List the files in the "embeddings" subfolder for testing
    file_list = os.listdir(embeddings_dir)
    print("Files in the 'embeddings' subfolder:")
    for file in file_list:
        print(file)
    
    # Load the embeddings
    our_embeddings = FAISS.load_local(embeddings_dir, embeddings)
    


    Hope this helps

Setup Info
    Tags
      Help me…