How do I extract filename of file uploaded using Dataset -> Upload your files

abhayt
abhayt Registered Posts: 3 ✭✭✭

I have a csv that contains 2 datasets arranged vertically (one below the other) in it -

1. Header

2. Body

After parsing these 2 datasets using prepare recipe, they need to be joined together.

However, there is no common key between these 2 datasets.

One way is to enrich these 2 datasets during prepare recipe step with the csv filename and then join the 2 datasets using this filename as the key.

I am unable to find any option in DSS that can help identify/ extract the uploaded file's name.

Please help.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi,

    In a prepare recipe you should be able to use: Misc > Enrich record with context information. Where you can add the filename and join based on that.

    https://doc.dataiku.com/dss/9.0/preparation/processors/enrich-with-record-context.html

    Please note there could some limitations for other file types besides txt or csv.

    See :

    https://community.dataiku.com/t5/Using-Dataiku-DSS/quot-Enrich-records-with-files-info-quot-in-prepare-recipes/m-p/11092#M5168

    Let me know if this would work for you.

    Thanks,

  • abhayt
    abhayt Registered Posts: 3 ✭✭✭

    Unfortunately, I don't see this option in my version of DSS, any other suggestions please.

    Dataiku DSS

    Version 6.0.1

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
    edited July 17

    If you are unable to upgrade.

    One possible suggestion would be to use a managed folder to upload all your files to. Use a python recipe to add the file name and output to another managed folder from which you can build create your datasets.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import os
    
    input_folder = dataiku.Folder("PAcVjikK")
    paths = input_folder.list_paths_in_partition()
    output_folder = dataiku.Folder("MLpqB40C")
    
    # Iterate through files, check if they fit certain regex condition, and write them to output managed folders accordingly.
    x=0
    for paths[x] in paths:
        with input_folder.get_download_stream(paths[x]) as f:
            data = pd.read_csv(f)
            filename= paths[x][1:]
            print(filename)
            data['filename_column'] = filename
            print(data)
            output_folder.upload_stream(filename, data.to_csv(index=False).encode("utf-8"))
    x +=1
    

Setup Info
    Tags
      Help me…