Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

How to manage your input files and flow in general

Solved!
emate
Neuron
Neuron
How to manage your input files and flow in general

Hi All,

So I know how to create my flow and which files to use.

I know that for this project I will have to manually upload 2 files separately.

The thing is  I will have to upload this files every month and run a scenario - to update everything.

How to manage this updating? Because now if I want to upload same file with new data, and i'm trying to name it as it was previously it doesn't allow me to do that.

Also do you have any tips and tricks for performance [or good practices] and generally managing this uploading / refreshing process?

 

Thanks

eMate

3 Solutions
ATsao
Dataiker
Dataiker

Hi eMate,

Unfortunately, there is no good way to "automatically" update an Upload dataset as it precisely involves files that have been manually uploaded by the user. Instead, if you expect to have new files/data that come in every month, you may want to consider storing the data/files in some kind of datastore (by building a data ingestion procedure further upstream). For example, you could read the data into some kind of database or you could store the files on some kind of filesystem datastore (whether that's S3, FTP, HDFS, filesystem, etc.), which DSS would then connect to and which would allow you to then build a workflow that automatically rebuilds your Flow so that your datasets will use the updated data accordingly. 

To elaborate, very simply, what you could consider doing is to have these files automatically be added to some kind of directory that the DSS server has access to (perhaps it might even be mounted). This is something that you would need to handle as this part would occur "before" the data is read into DSS. Then, you could create a filesystem connection that points to the folder where the files are being stored as your initial input dataset in your Flow (assuming the schema/format of the files will remain the same). You could then create a scenario to rebuild your Flow, using either a time-based trigger or the dataset modification trigger to automate this process, to handle ingesting the new data at a periodic interval. 

I hope that this helps or at least gives you a better idea of what you'll need to do!

Thanks,

Andrew

View solution in original post

Marlan
Neuron
Neuron

Hi @emate ,

In addition to @ATsao's  suggestions which could allow to you more fully automate your process, we have also used the approach of creating a folder and then upload files to that. You can upload multiple files there and re-upload new files under the same name and you will have the option to overwite the existing files. 

To create the folder from the flow, click + DATASET  and select the "Folder" option. For the Store into parameter, specify filesystem_folders. 

Once the files are uploaded, you can either read them directly in Python recipe or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset.  Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. 

Marlan

View solution in original post

CoreyS
Community Manager
Community Manager

I may be misunderstanding so forgive me, but have you utilized the Flow Actions section? The Build All section in particular would be one that would update the entire flow once you input the new dataset.

 

Screen Shot 2020-04-07 at 9.02.46 AM.png

 

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

View solution in original post

0 Kudos
11 Replies
ATsao
Dataiker
Dataiker

Hi eMate,

Unfortunately, there is no good way to "automatically" update an Upload dataset as it precisely involves files that have been manually uploaded by the user. Instead, if you expect to have new files/data that come in every month, you may want to consider storing the data/files in some kind of datastore (by building a data ingestion procedure further upstream). For example, you could read the data into some kind of database or you could store the files on some kind of filesystem datastore (whether that's S3, FTP, HDFS, filesystem, etc.), which DSS would then connect to and which would allow you to then build a workflow that automatically rebuilds your Flow so that your datasets will use the updated data accordingly. 

To elaborate, very simply, what you could consider doing is to have these files automatically be added to some kind of directory that the DSS server has access to (perhaps it might even be mounted). This is something that you would need to handle as this part would occur "before" the data is read into DSS. Then, you could create a filesystem connection that points to the folder where the files are being stored as your initial input dataset in your Flow (assuming the schema/format of the files will remain the same). You could then create a scenario to rebuild your Flow, using either a time-based trigger or the dataset modification trigger to automate this process, to handle ingesting the new data at a periodic interval. 

I hope that this helps or at least gives you a better idea of what you'll need to do!

Thanks,

Andrew

View solution in original post

Marlan
Neuron
Neuron

Hi @emate ,

In addition to @ATsao's  suggestions which could allow to you more fully automate your process, we have also used the approach of creating a folder and then upload files to that. You can upload multiple files there and re-upload new files under the same name and you will have the option to overwite the existing files. 

To create the folder from the flow, click + DATASET  and select the "Folder" option. For the Store into parameter, specify filesystem_folders. 

Once the files are uploaded, you can either read them directly in Python recipe or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset.  Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. 

Marlan

View solution in original post

emate
Neuron
Neuron
Author

Hi All,

Thanks both for you input. 

@MarlanSo basically, the idea is to create this folder with 2 files in my case,  then to create datasets out of those 2, and I can easily overwrite them every month and can I trigger running scenario every time the files will be overwritten automatically? 

0 Kudos
emate
Neuron
Neuron
Author

Hi again,

I did check it, but im not sure how to overwrite a file so everything updates.

I mean wheb I am overwritting a file in folder it is ok, but the created dataset in the flow didnt update, do you have an idea why that migh be?

 

Thanks

0 Kudos
CoreyS
Community Manager
Community Manager

I may be misunderstanding so forgive me, but have you utilized the Flow Actions section? The Build All section in particular would be one that would update the entire flow once you input the new dataset.

 

Screen Shot 2020-04-07 at 9.02.46 AM.png

 

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

View solution in original post

0 Kudos
emate
Neuron
Neuron
Author

Hi again,

Ok, it is working after all, my bad,  what confused me is that I uploaded 2 files into this folder and I followed this step : " or you can turn them into Datasets for subsequent use in your flow. To do this, right click on a file name and choose Create a dataset.  Then click TEST, PREVIEW and then specify needed settings. For example, for Excel files you can choose which sheets you want. Then specify a name for the dataset and create it. " Then I made a simple test prep removing rows based on col val.

This created dataset Test_flexview_1 - I set scenario to run when the data in "test" folder changes. Excel file I used contains only 10 rows, to check wheter this was working, after job was done ,  I was looking at Test_flexview_1 dataset and when I was inspecting/exploring it was not  updated in terms of number of rows so I thought this is not working. But the final dataset after this 'prep' icon contains all new rows after all.... I just thought since it is running the scenario as a whole it should also refresh a view of this dataset created out of folder.

 

 

vic
Level 2

Hi,

I have the same problem. the dataset view "files in folder" does not refresh after a change in the folder's files.

I tried to run a Python recipe to create a dataset with the most recent file but since the folder is a HDFS folder I do not have access (or couldn't find the doc) to creation or last modification time.

 

0 Kudos
emate
Neuron
Neuron
Author

Hi @vic 

It is working for me after all - I have just created a folder, put the files in it -> Create a dataset and thats all.

After this step, I am just uploading (overwritting) the file manually and after I re run the flow everything is updated.

If this is not the case in your project?

Mateusz

0 Kudos
emate
Neuron
Neuron
Author

dss.png

 

0 Kudos
vic
Level 2

It doesn't get refreshed after run.

My problem is files do not get overwritted but added. Each new file has a new name (because they're timestamped). My dataset has the option "All files" from the folder.

0 Kudos
emate
Neuron
Neuron
Author

Hi @vic 

I think I saw a workaround somewhere, try this -  upload first file, creat dataset and then upload second file with diffrent name (without creating dataset). Then, before the re-running the flow just find the recipe (1), go into "files" and add this files manually. (2)

1)dss2.png

2)dss.png

On the other hand, I don't know if this is optimal, you could also just use stack recipe.

 

Mateusz

0 Kudos
A banner prompting to get Dataiku DSS