Join us on August 11th as we explore Emotion Classification on Video with phData Learn more

Listing and Reading all the files in a Managed Folder

Level 2
Listing and Reading all the files in a Managed Folder

Hi All,

Can you help me in reading all the files present in a HDFS managed folder based on certain criteria/Pattern and writing the files into a different HDFS managed folder.

Attaching the problem statment. Please help.

Dataiku_Problem_statment.JPG

 Thanks in advance. 

 

 

 

 

 

 

 

4 Replies
Dataiker
Dataiker

Hi SuhasChinku,

One option would be to use a Python recipe to read in the inputs of this HDFS managed folder, filter on the file names (using regex), and then copying over the files accordingly to the appropriate output managed folders by using the read/write APIs

import dataiku
import re

# Read inputs and managed folders. Make sure to use the appropriate managed folder IDs. 
input_folder = dataiku.Folder("INPUT_MANAGED_FOLDER_ID")
paths = input_folder.list_paths_in_partition()
output_folder1 = dataiku.Folder("OUTPUT_MANAGED_FOLDER1_ID")
output_folder2 = dataiku.Folder("OUTPUT_MANAGED_FOLDER2_ID")

# Iterate through files, check if they fit certain regex condition, and write them to output managed folders accordingly.
x=0
for paths[x] in paths:
    # Check if file starts with "/File_" and, if so, copy the file to the first output managed folder. Replace with appropriate regex as needed.
    if re.match(r"/[F|f]ile_\d+", paths[x]):
        with input_folder.get_download_stream(paths[x]) as f:
            data = f.read()
        with output_folder1.get_writer(paths[x]) as w:
            w.write(data)
    # Check if file starts with "/Input_file_" and, if so, copy the file to the second output managed folder. Replace with appropriate regex as needed.
    if re.match(r"/[i|I]nput_file_\d+", paths[x]):
        with input_folder.get_download_stream(paths[x]) as f:
            data = f.read()
        with output_folder2.get_writer(paths[x]) as w:
            w.write(data)
x +=1

 

I hope that this helps!

Best,

Andrew

 

Level 2
Author

Hi Andrew @ATsao,

Solution works Pefect...!!!!! Thank you so much.. 🙂

Dataiker
Dataiker

Hi SuhasChinku,

As an alternative, you can utilize internal "files from folder" dataset to filter your files with regex

Screenshot 2020-07-07 at 09.12.10.pngScreenshot 2020-07-07 at 09.12.49.pngScreenshot 2020-07-07 at 09.12.36.png

Such approach would not use managed folder as output though. Generally speaking, it is not as flexible as Python code but could be useful if you prefer visual recipes over code recipes.

Level 2
Author

@dima_naboka ,

Thanks for your solution as well..:-) I was not aware of this .. I will leverage this idea..