So I know how to create my flow and which files to use.
I know that for this project I will have to manually upload 2 files separately.
The thing is I will have to upload this files every month and run a scenario - to update everything.
How to manage this updating? Because now if I want to upload same file with new data, and i'm trying to name it as it was previously it doesn't allow me to do that.
Also do you have any tips and tricks for performance [or good practices] and generally managing this uploading / refreshing process?
Unfortunately, there is no good way to "automatically" update an Upload dataset as it precisely involves files that have been manually uploaded by the user. Instead, if you expect to have new files/data that come in every month, you may want to consider storing the data/files in some kind of datastore (by building a data ingestion procedure further upstream). For example, you could read the data into some kind of database or you could store the files on some kind of filesystem datastore (whether that's S3, FTP, HDFS, filesystem, etc.), which DSS would then connect to and which would allow you to then build a workflow that automatically rebuilds your Flow so that your datasets will use the updated data accordingly.
To elaborate, very simply, what you could consider doing is to have these files automatically be added to some kind of directory that the DSS server has access to (perhaps it might even be mounted). This is something that you would need to handle as this part would occur "before" the data is read into DSS. Then, you could create a filesystem connection that points to the folder where the files are being stored as your initial input dataset in your Flow (assuming the schema/format of the files will remain the same). You could then create a scenario to rebuild your Flow, using either a time-based trigger or the dataset modification trigger to automate this process, to handle ingesting the new data at a periodic interval.
I hope that this helps or at least gives you a better idea of what you'll need to do!
Hi @emate ,
In addition to @ATsao's suggestions which could allow to you more fully automate your process, we have also used the approach of creating a folder and then upload files to that. You can upload multiple files there and re-upload new files under the same name and you will have the option to overwite the existing files.
To create the folder from the flow, click + DATASET and select the "Folder" option. For the Store into parameter, specify filesystem_folders.
Once the files are uploaded, you can either read them directly in Python recipe or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset. Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it.
Thanks both for you input.
@MarlanSo basically, the idea is to create this folder with 2 files in my case, then to create datasets out of those 2, and I can easily overwrite them every month and can I trigger running scenario every time the files will be overwritten automatically?
I did check it, but im not sure how to overwrite a file so everything updates.
I mean wheb I am overwritting a file in folder it is ok, but the created dataset in the flow didnt update, do you have an idea why that migh be?
I may be misunderstanding so forgive me, but have you utilized the Flow Actions section? The Build All section in particular would be one that would update the entire flow once you input the new dataset.
I hope this helps!
Ok, it is working after all, my bad, what confused me is that I uploaded 2 files into this folder and I followed this step : " or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset. Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. " Then I made a simple test prep removing rows based on col val.
This created dataset Test_flexview_1 - I set scenario to run when the data in "test" folder changes. Excel file I used contains only 10 rows, to check wheter this was working, after job was done , I was looking at Test_flexview_1 dataset and when I was inspecting/exploring it was not updated in terms of number of rows so I thought this is not working. But the final dataset after this 'prep' icon contains all new rows after all.... I just thought since it is running the scenario as a whole it should also refresh a view of this dataset created out of folder.
I have the same problem. the dataset view "files in folder" does not refresh after a change in the folder's files.
I tried to run a Python recipe to create a dataset with the most recent file but since the folder is a HDFS folder I do not have access (or couldn't find the doc) to creation or last modification time.
It is working for me after all - I have just created a folder, put the files in it -> Create a dataset and thats all.
After this step, I am just uploading (overwritting) the file manually and after I re run the flow everything is updated.
If this is not the case in your project?
It doesn't get refreshed after run.
My problem is files do not get overwritted but added. Each new file has a new name (because they're timestamped). My dataset has the option "All files" from the folder.
I think I saw a workaround somewhere, try this - upload first file, creat dataset and then upload second file with diffrent name (without creating dataset). Then, before the re-running the flow just find the recipe (1), go into "files" and add this files manually. (2)
On the other hand, I don't know if this is optimal, you could also just use stack recipe.