Add S3 path name parts as columns in dataset?
I have source S3 files whose paths are of the form <engine>_<yyyymmdd>/<tablename>.csv. I want to take all files named mytable.csv and create a dataset whose fields are those in the files - PLUS the fields "engine" and "date", with values for each record derived from that record's source file path. How can I accomplish this in DSS, with visual and/or code elements as needed? (Partioning seems to do a lot of what I want, but I can't find how to turn partitions into dataset fields.)
Operating system used: Linux
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
Hi @MarkPundurs
,You can turn partition in a field in the dataset using the processor below see (2).
1) You can also use "Files from folder" dataset and filter the files you want to include.
2) You can use "Enrich records with files info" in prepare recipe to file path of the output the prepare recipe will create.
Let me know if this would work for you.