Community Conundrum 25:Feature Visualization is now live! Read More

How to manage your input files and flow in general

Level 4
How to manage your input files and flow in general

Hi All,

So I know how to create my flow and which files to use.

I know that for this project I will have to manually upload 2 files separately.

The thing is  I will have to upload this files every month and run a scenario - to update everything.

How to manage this updating? Because now if I want to upload same file with new data, and i'm trying to name it as it was previously it doesn't allow me to do that.

Also do you have any tips and tricks for performance [or good practices] and generally managing this uploading / refreshing process?

 

Thanks

eMate

6 Replies
Dataiker
Dataiker

Hi eMate,

Unfortunately, there is no good way to "automatically" update an Upload dataset as it precisely involves files that have been manually uploaded by the user. Instead, if you expect to have new files/data that come in every month, you may want to consider storing the data/files in some kind of datastore (by building a data ingestion procedure further upstream). For example, you could read the data into some kind of database or you could store the files on some kind of filesystem datastore (whether that's S3, FTP, HDFS, filesystem, etc.), which DSS would then connect to and which would allow you to then build a workflow that automatically rebuilds your Flow so that your datasets will use the updated data accordingly. 

To elaborate, very simply, what you could consider doing is to have these files automatically be added to some kind of directory that the DSS server has access to (perhaps it might even be mounted). This is something that you would need to handle as this part would occur "before" the data is read into DSS. Then, you could create a filesystem connection that points to the folder where the files are being stored as your initial input dataset in your Flow (assuming the schema/format of the files will remain the same). You could then create a scenario to rebuild your Flow, using either a time-based trigger or the dataset modification trigger to automate this process, to handle ingesting the new data at a periodic interval. 

I hope that this helps or at least gives you a better idea of what you'll need to do!

Thanks,

Andrew

Level 4

Hi @emate ,

In addition to @ATsao's  suggestions which could allow to you more fully automate your process, we have also used the approach of creating a folder and then upload files to that. You can upload multiple files there and re-upload new files under the same name and you will have the option to overwite the existing files. 

To create the folder from the flow, click + DATASET  and select the "Folder" option. For the Store into parameter, specify filesystem_folders. 

Once the files are uploaded, you can either read them directly in Python recipe or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset.  Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. 

Marlan

Level 4
Author

Hi All,

Thanks both for you input. 

@MarlanSo basically, the idea is to create this folder with 2 files in my case,  then to create datasets out of those 2, and I can easily overwrite them every month and can I trigger running scenario every time the files will be overwritten automatically? 

0 Kudos
Level 4
Author

Hi again,

I did check it, but im not sure how to overwrite a file so everything updates.

I mean wheb I am overwritting a file in folder it is ok, but the created dataset in the flow didnt update, do you have an idea why that migh be?

 

Thanks

0 Kudos
Community Manager
Community Manager

I may be misunderstanding so forgive me, but have you utilized the Flow Actions section? The Build All section in particular would be one that would update the entire flow once you input the new dataset.

 

Screen Shot 2020-04-07 at 9.02.46 AM.png

 

I hope this helps!

Don't forget to mark as "Accepted Solution" when someone provides the correct answer to your question.
0 Kudos
Level 4
Author

Hi again,

Ok, it is working after all, my bad,  what confused me is that I uploaded 2 files into this folder and I followed this step : " or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset.  Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. " Then I made a simple test prep removing rows based on col val.

This created dataset Test_flexview_1 - I set scenario to run when the data in "test" folder changes. Excel file I used contains only 10 rows, to check wheter this was working, after job was done ,  I was looking at Test_flexview_1 dataset and when I was inspecting/exploring it was not  updated in terms of number of rows so I thought this is not working. But the final dataset after this 'prep' icon contains all new rows after all.... I just thought since it is running the scenario as a whole it should also refresh a view of this dataset created out of folder.