Add S3 path name parts as columns in dataset?

MarkPundurs
MarkPundurs Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 27 ✭✭✭✭

I have source S3 files whose paths are of the form <engine>_<yyyymmdd>/<tablename>.csv. I want to take all files named mytable.csv and create a dataset whose fields are those in the files - PLUS the fields "engine" and "date", with values for each record derived from that record's source file path. How can I accomplish this in DSS, with visual and/or code elements as needed? (Partioning seems to do a lot of what I want, but I can't find how to turn partitions into dataset fields.)


Operating system used: Linux

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,285 Dataiker
    Answer ✓

    Hi @MarkPundurs
    ,

    You can turn partition in a field in the dataset using the processor below see (2).

    1) You can also use "Files from folder" dataset and filter the files you want to include.

    Screenshot 2022-02-25 at 08.43.43.png

    Screenshot 2022-02-25 at 08.44.21.png

    2) You can use "Enrich records with files info" in prepare recipe to file path of the output the prepare recipe will create. Screenshot 2022-02-25 at 08.40.56.png

    Let me know if this would work for you.

Setup Info
    Tags
      Help me…