How do I extract filename of file uploaded using Dataset -> Upload your files

abhayt · July 2021

I have a csv that contains 2 datasets arranged vertically (one below the other) in it -

1. Header

2. Body

After parsing these 2 datasets using prepare recipe, they need to be joined together.

However, there is no common key between these 2 datasets.

One way is to enrich these 2 datasets during prepare recipe step with the csv filename and then join the 2 datasets using this filename as the key.

I am unable to find any option in DSS that can help identify/ extract the uploaded file's name.

Please help.

Alexandru · July 2021

Hi,

In a prepare recipe you should be able to use: Misc > Enrich record with context information. Where you can add the filename and join based on that.

https://doc.dataiku.com/dss/9.0/preparation/processors/enrich-with-record-context.html

Please note there could some limitations for other file types besides txt or csv.

See :

https://community.dataiku.com/t5/Using-Dataiku-DSS/quot-Enrich-records-with-files-info-quot-in-prepare-recipes/m-p/11092#M5168

Let me know if this would work for you.

Thanks,

abhayt · July 2021

Unfortunately, I don't see this option in my version of DSS, any other suggestions please.

Dataiku DSS

Version 6.0.1

Alexandru · July 2021

If you are unable to upgrade.

One possible suggestion would be to use a managed folder to upload all your files to. Use a python recipe to add the file name and output to another managed folder from which you can build create your datasets.

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

input_folder = dataiku.Folder("PAcVjikK")
paths = input_folder.list_paths_in_partition()
output_folder = dataiku.Folder("MLpqB40C")

# Iterate through files, check if they fit certain regex condition, and write them to output managed folders accordingly.
x=0
for paths[x] in paths:
    with input_folder.get_download_stream(paths[x]) as f:
        data = pd.read_csv(f)
        filename= paths[x][1:]
        print(filename)
        data['filename_column'] = filename
        print(data)
        output_folder.upload_stream(filename, data.to_csv(index=False).encode("utf-8"))
x +=1

How do I extract filename of file uploaded using Dataset -> Upload your files

Answers

Categories

Setup Info

Tags