How Do I filter the content of a managed folder and output to another managed folder?

Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 1 Dataiker
edited March 21 in Using Dataiku

This is a question I've asked myself and solved with a little Python code so I thought I'd share. I had a folder with several subfolders, each containing a JPEG for each page of the original PDF (for context this folder is the output of the Greyscale recipe from our Text Extraction plugin). I really only want to parse data/create RAG pipelines from the first page of every file. First I used List Files Recipe on my input folder. That and the input folder are the inputs to my Python recipe. Here is the code I used to do create that subset:

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import shutil
import os

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Read recipe inputs
input_dataset = dataiku.Dataset("jpegs_local_files") #change this to your List Files output
df = input_dataset.get_dataframe()
input_folder = dataiku.Folder("WT1Fqq9q") #Change to your input folder ID
input_folder_path = input_folder.get_path()
print(input_folder_path)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE 
#Creating a list of the files I want in my output folder
filtered_files = df[df["path"].str.endswith("1.jpg")]["path"].tolist()
print(filtered_files)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
output_folder = dataiku.Folder("88GP6J5f")  # Change to your output folder ID
output_folder_path = output_folder.get_path()
print(output_folder_path)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#Loop through the list of files you want and add their paths to your output folder path
#Note this code will need a different method if your output folder is not stored in the managed filesystem
for file_path in filtered_files: source_path = os.path.join(input_folder_path, file_path.lstrip("/")) destination_path = os.path.join(output_folder_path, os.path.basename(file_path)) # Destination path print(f"Checking file: {source_path}") if os.path.exists(source_path): # Ensure the file exists shutil.copy(source_path, destination_path) print(f"Copied: {file_path} → {destination_path}")

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.