Reading subfolder content of managed folder in the code studios (streamlit app env, langchain)

raniachk
Level 1
Reading subfolder content of managed folder in the code studios (streamlit app env, langchain)

Hello, 

I want to use the langchain library to create an application that reads files in subfolders of managed folders. We are creating embeddings and saving them in the "subsubfolder". When loading the embeddings in a notebook it works but not from a code studio.

 

 

import dataiku
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# saving the embeddings
handle = dataiku.Folder("my_folder")
folder_path = handle.get_path()
saving_directory = folder_path + "/subfolder/subsubfolder/"

 

 

 When we try to read the embeddings this way in a jupyter notebook, it works well.

 

 

embeddings = OpenAIEmbeddings()
our_embeddings = FAISS.load_local(saving_directory, embeddings)

 

 

 But when we call this to create an app in a code studio we get a FileNotFoundError: [Errno 2] No such file or directory: 'our_path' and we are not able to use get_download_stream() here since the library here needs access to a directory. 

Does anyone have an idea on how to solve this? 

 

Thank you


Operating system used: macOS/Windows

0 Kudos
1 Reply
AlexT
Dataiker

Hi,
Since Code Studio runs in a container, you must interact with managed folder using get_download_stream/upload_stream.

If a library needs to interact with a local folder you can something like tempfile/tempdir or the current working directory( cwd) and then copy the files from the managed folder using get_download_stream locally them and later if needed to you use upload_streamt o persist files back to the managed folder. 

import dataiku
import pandas as pd
import numpy as np
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import os

# Get the current working directory
cwd = os.getcwd()

# Create a subfolder named "embeddings" within the current working directory
embeddings_dir = os.path.join(cwd, "embeddings")
os.makedirs(embeddings_dir, exist_ok=True)

# Copy all files from the managed folder to the "embeddings" subfolder
embed_fldr = dataiku.Folder("H5DTgHOr")
file_names = embed_fldr.list_paths_in_partition()
for file_name in file_names:
    with embed_fldr.get_download_stream(file_name) as f:
        dst_path = os.path.join(embeddings_dir, os.path.basename(file_name))
        with open(dst_path, 'wb') as dst_file:
            dst_file.write(f.read())

# List the files in the "embeddings" subfolder for testing
file_list = os.listdir(embeddings_dir)
print("Files in the 'embeddings' subfolder:")
for file in file_list:
    print(file)

# Load the embeddings
our_embeddings = FAISS.load_local(embeddings_dir, embeddings)


Hope this helps