Invoking an external utility, that needs HDFS Folders for input & output, from within a Recipe

navraj28 Registered Posts: 2 ✭✭✭

Hello, I am very new to Dataiku. My use-case might not be typical. I want to build an NLP pipeline, where each stage reads one file (say PDF) & produces another, say an XML file. I want to invoke a 3rd party program that requires 2 parameters, an HDFS input folder & an HDFS output folder. I see that a Recipe also requires an input & an output folder, for which I can define Managed folders. In my case, the actual Reading & Writing into HDFS folders will be performed by the 3rd party program, without using any DSS APIs. In that case, will I be hard-coding the folder names with-in my Recipe?

Here is the pseudo code for the Recipe:

HDFS_Input_Folder = "/input"

HDFS_Output_Folder = "/output"

callThirdPartyAPI(HDFS_Input_Folder, HDFS_Output_Folder)

#Now, where am I using the Managed Folders associated with the Recipe?

#The 3rd Party service is running on another server & Reads/Writes into HDFS folder

#I am using Dataiku only to build a Flow

#Can I use the information from the Managed Folder to build the "HDFS File Path" required by the 3rd Party app? How?


  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭

    Hi @navraj28

    There is limited information in the question but we can try a few things:

    - Using the DSS Python API you could indeed obtain information about your folders, the ones containing the PDFs and the one that will contain the outputs. The docs for folders are here

    - Both folders have to be on an HDFS connection (perhaps your tool will accept S3 or such?)

    - You could try to pass on those parameters to the 3rd party tool.

    Good luck with your project!

Setup Info
      Help me…