Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

How do I extract filename of file uploaded using Dataset -> Upload your files

abhayt
Level 1
How do I extract filename of file uploaded using Dataset -> Upload your files

I have a csv that contains 2 datasets arranged vertically (one below the other) in it - 

1. Header

2. Body

After parsing these 2 datasets using prepare recipe, they need to be joined together.

However, there is no common key between these 2 datasets.

One way is to enrich these 2 datasets during prepare recipe step with the csv filename and then join the 2 datasets using this filename as the key.

I am unable to find any option in DSS that can help identify/ extract the uploaded file's name.

Please help.

0 Kudos
3 Replies
AlexT
Dataiker
Dataiker

Hi,

In a prepare recipe you should be able to use: Misc > Enrich record with context information. Where you can add the filename and join based on that. 

https://doc.dataiku.com/dss/9.0/preparation/processors/enrich-with-record-context.html

Please note there could some limitations for other file types besides txt or csv. 

See : 

https://community.dataiku.com/t5/Using-Dataiku-DSS/quot-Enrich-records-with-files-info-quot-in-prepa...

Let me know if this would work for you. 

 

Thanks,

 

0 Kudos
abhayt
Level 1
Author

Unfortunately, I don't see this option in my version of DSS, any other suggestions please.

Dataiku DSS

Version 6.0.1

0 Kudos
AlexT
Dataiker
Dataiker

If you are unable to upgrade.

One possible suggestion would be to use a managed folder to upload all your files to. Use a python recipe to add the file name and output to another managed folder from which you can build create your datasets. 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

input_folder = dataiku.Folder("PAcVjikK")
paths = input_folder.list_paths_in_partition()
output_folder = dataiku.Folder("MLpqB40C")

# Iterate through files, check if they fit certain regex condition, and write them to output managed folders accordingly.
x=0
for paths[x] in paths:
    with input_folder.get_download_stream(paths[x]) as f:
        data = pd.read_csv(f)
        filename= paths[x][1:]
        print(filename)
        data['filename_column'] = filename
        print(data)
        output_folder.upload_stream(filename, data.to_csv(index=False).encode("utf-8"))
x +=1

 

 

 

0 Kudos
A banner prompting to get Dataiku DSS