How to manage your input files and flow in general

Mateusz
Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

Hi All,

So I know how to create my flow and which files to use.

I know that for this project I will have to manually upload 2 files separately.

The thing is I will have to upload this files every month and run a scenario - to update everything.

How to manage this updating? Because now if I want to upload same file with new data, and i'm trying to name it as it was previously it doesn't allow me to do that.

Also do you have any tips and tricks for performance [or good practices] and generally managing this uploading / refreshing process?

Thanks

eMate

Best Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Answer ✓

    Hi eMate,

    Unfortunately, there is no good way to "automatically" update an Upload dataset as it precisely involves files that have been manually uploaded by the user. Instead, if you expect to have new files/data that come in every month, you may want to consider storing the data/files in some kind of datastore (by building a data ingestion procedure further upstream). For example, you could read the data into some kind of database or you could store the files on some kind of filesystem datastore (whether that's S3, FTP, HDFS, filesystem, etc.), which DSS would then connect to and which would allow you to then build a workflow that automatically rebuilds your Flow so that your datasets will use the updated data accordingly.

    To elaborate, very simply, what you could consider doing is to have these files automatically be added to some kind of directory that the DSS server has access to (perhaps it might even be mounted). This is something that you would need to handle as this part would occur "before" the data is read into DSS. Then, you could create a filesystem connection that points to the folder where the files are being stored as your initial input dataset in your Flow (assuming the schema/format of the files will remain the same). You could then create a scenario to rebuild your Flow, using either a time-based trigger or the dataset modification trigger to automate this process, to handle ingesting the new data at a periodic interval.

    I hope that this helps or at least gives you a better idea of what you'll need to do!

    Thanks,

    Andrew

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
    Answer ✓

    Hi @emate
    ,

    In addition to @ATsao
    's suggestions which could allow to you more fully automate your process, we have also used the approach of creating a folder and then upload files to that. You can upload multiple files there and re-upload new files under the same name and you will have the option to overwite the existing files.

    To create the folder from the flow, click + DATASET and select the "Folder" option. For the Store into parameter, specify filesystem_folders.

    Once the files are uploaded, you can either read them directly in Python recipe or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset. Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it.

    Marlan

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    I may be misunderstanding so forgive me, but have you utilized the Flow Actions section? The Build All section in particular would be one that would update the entire flow once you input the new dataset.

    Screen Shot 2020-04-07 at 9.02.46 AM.png

    I hope this helps!

Answers

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi All,

    Thanks both for you input.

    @Marlan
    So basically, the idea is to create this folder with 2 files in my case, then to create datasets out of those 2, and I can easily overwrite them every month and can I trigger running scenario every time the files will be overwritten automatically?

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi again,

    I did check it, but im not sure how to overwrite a file so everything updates.

    I mean wheb I am overwritting a file in folder it is ok, but the created dataset in the flow didnt update, do you have an idea why that migh be?

    Thanks

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi again,

    Ok, it is working after all, my bad, what confused me is that I uploaded 2 files into this folder and I followed this step : " or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset. Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. " Then I made a simple test prep removing rows based on col val.

    This created dataset Test_flexview_1 - I set scenario to run when the data in "test" folder changes. Excel file I used contains only 10 rows, to check wheter this was working, after job was done , I was looking at Test_flexview_1 dataset and when I was inspecting/exploring it was not updated in terms of number of rows so I thought this is not working. But the final dataset after this 'prep' icon contains all new rows after all.... I just thought since it is running the scenario as a whole it should also refresh a view of this dataset created out of folder.

  • vic
    vic Registered Posts: 16 ✭✭✭✭

    Hi,

    I have the same problem. the dataset view "files in folder" does not refresh after a change in the folder's files.

    I tried to run a Python recipe to create a dataset with the most recent file but since the folder is a HDFS folder I do not have access (or couldn't find the doc) to creation or last modification time.

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi @vic

    It is working for me after all - I have just created a folder, put the files in it -> Create a dataset and thats all.

    After this step, I am just uploading (overwritting) the file manually and after I re run the flow everything is updated.

    If this is not the case in your project?

    Mateusz

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    This screenshot shows the context menu used to "Create a dataset":
    dss.png

  • vic
    vic Registered Posts: 16 ✭✭✭✭

    It doesn't get refreshed after run.

    My problem is files do not get overwritted but added. Each new file has a new name (because they're timestamped). My dataset has the option "All files" from the folder.

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭

    Hi @vic

    I think I saw a workaround somewhere, try this - upload first file, creat dataset and then upload second file with diffrent name (without creating dataset). Then, before the re-running the flow just find the recipe (1), go into "files" and add this files manually. (2)

    1)dss2.png

    2)dss.png

    On the other hand, I don't know if this is optimal, you could also just use stack recipe.

    Mateusz

  • ColleenSpence
    ColleenSpence Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭

    Hi eMate,

    I am facing a similar task soon due to my data source being an email attachment on outlook.

    I haven't got my DSS instance yet, but I am currently working with data flows with MS Power Automate. If your company has office 365 you could use the online power automate to grab attachments from the outlook email with query parameters (using From:/ Subject Line / email body content). Once the flow has got the attachment it could be placed in a specific OneDrive folder which could then trigger you DSS flow (using the OneDrive plugin as a connector).

    If you don't have office 365 you could use Power Automate for Desktop to place your files in a location that DSS could access as people below have mentioned.

Setup Info
    Tags
      Help me…