Reading subfolder content of managed folder in the code studios (streamlit app env, langchain)
Hello,
I want to use the langchain library to create an application that reads files in subfolders of managed folders. We are creating embeddings and saving them in the "subsubfolder". When loading the embeddings in a notebook it works but not from a code studio.
import dataiku from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import FAISS # saving the embeddings handle = dataiku.Folder("my_folder") folder_path = handle.get_path() saving_directory = folder_path + "/subfolder/subsubfolder/"
When we try to read the embeddings this way in a jupyter notebook, it works well.
embeddings = OpenAIEmbeddings() our_embeddings = FAISS.load_local(saving_directory, embeddings)
But when we call this to create an app in a code studio we get a FileNotFoundError: [Errno 2] No such file or directory: 'our_path' and we are not able to use get_download_stream() here since the library here needs access to a directory.
Does anyone have an idea on how to solve this?
Thank you
Operating system used: macOS/Windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Hi,
Since Code Studio runs in a container, you must interact with managed folder using get_download_stream/upload_stream.
If a library needs to interact with a local folder you can something like tempfile/tempdir or the current working directory( cwd) and then copy the files from the managed folder using get_download_stream locally them and later if needed to you use upload_streamt o persist files back to the managed folder.import dataiku import pandas as pd import numpy as np from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import FAISS import os # Get the current working directory cwd = os.getcwd() # Create a subfolder named "embeddings" within the current working directory embeddings_dir = os.path.join(cwd, "embeddings") os.makedirs(embeddings_dir, exist_ok=True) # Copy all files from the managed folder to the "embeddings" subfolder embed_fldr = dataiku.Folder("H5DTgHOr") file_names = embed_fldr.list_paths_in_partition() for file_name in file_names: with embed_fldr.get_download_stream(file_name) as f: dst_path = os.path.join(embeddings_dir, os.path.basename(file_name)) with open(dst_path, 'wb') as dst_file: dst_file.write(f.read()) # List the files in the "embeddings" subfolder for testing file_list = os.listdir(embeddings_dir) print("Files in the 'embeddings' subfolder:") for file in file_list: print(file) # Load the embeddings our_embeddings = FAISS.load_local(embeddings_dir, embeddings)
Hope this helps