Listing and Reading all the files in a Managed Folder

Options
SuhasChinku
SuhasChinku Registered Posts: 5 ✭✭✭✭

Hi All,

Can you help me in reading all the files present in a HDFS managed folder based on certain criteria/Pattern and writing the files into a different HDFS managed folder.

Attaching the problem statment. Please help.

Dataiku_Problem_statment.JPG

 Thanks in advance.

Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    edited July 17
    Options

    Hi SuhasChinku,

    One option would be to use a Python recipe to read in the inputs of this HDFS managed folder, filter on the file names (using regex), and then copying over the files accordingly to the appropriate output managed folders by using the read/write APIs.

    import dataiku
    import re
    
    # Read inputs and managed folders. Make sure to use the appropriate managed folder IDs. 
    input_folder = dataiku.Folder("INPUT_MANAGED_FOLDER_ID")
    paths = input_folder.list_paths_in_partition()
    output_folder1 = dataiku.Folder("OUTPUT_MANAGED_FOLDER1_ID")
    output_folder2 = dataiku.Folder("OUTPUT_MANAGED_FOLDER2_ID")
    
    # Iterate through files, check if they fit certain regex condition, and write them to output managed folders accordingly.
    x=0
    for paths[x] in paths:
        # Check if file starts with "/File_" and, if so, copy the file to the first output managed folder. Replace with appropriate regex as needed.
        if re.match(r"/[F|f]ile_\d+", paths[x]):
            with input_folder.get_download_stream(paths[x]) as f:
                data = f.read()
            with output_folder1.get_writer(paths[x]) as w:
                w.write(data)
        # Check if file starts with "/Input_file_" and, if so, copy the file to the second output managed folder. Replace with appropriate regex as needed.
        if re.match(r"/[i|I]nput_file_\d+", paths[x]):
            with input_folder.get_download_stream(paths[x]) as f:
                data = f.read()
            with output_folder2.get_writer(paths[x]) as w:
                w.write(data)
    x +=1

    I hope that this helps!

    Best,

    Andrew

  • SuhasChinku
    SuhasChinku Registered Posts: 5 ✭✭✭✭
    Options

    Hi Andrew @ATsao
    ,

    Solution works Pefect...!!!!! Thank you so much..

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker
    Options

    Hi SuhasChinku,

    As an alternative, you can utilize internal "files from folder" dataset to filter your files with regex

    Screenshot 2020-07-07 at 09.12.10.pngScreenshot 2020-07-07 at 09.12.49.pngScreenshot 2020-07-07 at 09.12.36.png

    Such approach would not use managed folder as output though. Generally speaking, it is not as flexible as Python code but could be useful if you prefer visual recipes over code recipes.

  • SuhasChinku
    SuhasChinku Registered Posts: 5 ✭✭✭✭
    Options

    @dima_naboka
    ,

    Thanks for your solution as well..:-) I was not aware of this .. I will leverage this idea..

  • jaalija
    jaalija Registered Posts: 1 ✭✭✭✭
    Options

    Hi @ATsao
    ,

    I have a similar problem but I want to zip a file to another folder instead of just copying. Do you know how can i do this?

    I was trying this to find the zip file from a specific month from a defined variable:

    x=0
    for paths[x] in paths:
    if fnmatch.fnmatch(paths[x], '*' + (dataiku.get_custom_variables()["v_mth"]) + '.zip'):
    with input_folder.get_download_stream(paths[x]) as f:
    with zipfile.ZipFile(f.read(), "r") as zip_ref:
    zip_ref.extractall(output_folder)
    x +=1

    but I get the following error:

    Job failed: Error in Python process: At line 48: <class 'AttributeError'>: 'bytes' object has no attribute 'seek'

    Thank you very much!

  • akshay
    akshay Partner, Registered Posts: 7 Partner
    Options

    Hi @ATsao

    I have created a python API endpoint and within that, I am trying to access the folder created within the local file system.
    I am getting following error :
    dataiku_error_VD.JPG

    import dataiku
    folder = dataiku.Folder('xyz')

    path = folder.list_paths_in_partition()

    Unable to get the list of files presents within the folder created.

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Options

    Hi,

    Please note that endpoints are meant to be deployed to API nodes so you should generally not be using internal dataiku APIs from your endpoint as these endpoints are independent from your design node. If you wish to call a local managed folder, please follow the steps defined here to "package" your managed folder so that it can be referenced from your python API endpoint:

    https://doc.dataiku.com/dss/latest/apinode/endpoint-python-function.html#using-managed-folders

    Thanks,

    Andrew

Setup Info
    Tags
      Help me…