Question on automating folder with multiple files

aw30 · ‎06-24-2020

I am new to Dataiku and feel I am missing something as I cannot figure out how to do a simple automation when I have a number of files within a single S3 bucket, we are using Hadoop & Hive. Every day 6 new files are added and I want to grab the latest one and use it instead of the prior day's version. The naming convention is gid_date_filename.

For example:

xxxx-06222020-users_export

xxxx-06222020-incidents_export

I thought I could set up a flow using one of each file that includes one step which is the sync recipe to adjust the name and sync to our hive metastore and then use scenarios to automate this. A trigger would be set up once a day to grab the most current file for each of the 6 distinct files.

I grabbed the latest file just to set up the flow so I have 6 datasets each having a step that syncs to our hive metastore creating 6 tables.

To automate this I thought I needed to create a dataset pointing to the entire folder instead of a single file. When I go to create a new dataset for the folder, for the S3 connection I choose HDFS_S3General and for Path in bucket I leave it as / . When I do this I get a warning that dataset at root of connection. This is atypical do you want to create a managed dataset?

How do I get started on getting this created as I know it can be done but I feel like I am just missing something basic. Thank you for your time and help in advance!!!

Liev · ‎06-29-2020

Hi @aw30

The setup you're proposing is possible in DSS, but it will involve using several concepts simultaneously. However, you can also break it down into different flow elements and consolidate once you're comfortable with the logic and how everything fits together.

1 - Connecting to your bucket. Indeed if you want DSS to connect to the top level of the bucket you will receive a warning, but if this is where the files are, then you should be ok to proceed.

2 - Selecting which files you want and is everything needed? In your flow, go to + Dataset > Cloud Storage and Social > Amazon S3, select connection and bucket. Click on Show Advanced options > File selection > Only included. This will allow you to specify which of the files (given that each day several are written into the bucket) using a regex. This would specify the text part of the file name and use digits to represent the different timestamps.

3 - Is it one dataset or several? Since you have files such as

xxxx-06222020-users_export

xxxx-06222020-incidents_export

does this mean that you need one dataset for users_export and another for incidents_export? If so, then repeat and adjust step 2 to create multiple datasets.

4 - Enable partitioning on your dataset. Open the dataset and Go to Settings > Partitioning > Add Time dimension and select granularity from Year, Month, Day or Hour. In the pattern section, you need to put the pattern that will allow DSS to recognise the different files and assign to the correct partition.

using your examples the pattern might look something like this

xxxx-%M%D%Y-users_export

Verify that the partitions are being picked up correctly by pressing the List Partitions button. Repeat for any other dataset where the same is required.

5 - Now we're ready to sync. Click on the dataset and select the Sync recipe, create a new dataset as an output. Since you want to constantly only have the last results, this output dataset doesn't need to be partitioned. Make sure you're selecting Non Partitioned in the recipe creation popup.

Create the recipe. Under Configuration > Input select Latest available from the mapping function dropdown under the Input dataset name.

You're done, run it!

I hope this helps!

View solution in original post

Liev · ‎06-29-2020

yes, the described method would work on any kind of filesystem, so HDFS should be fine once you can see the files.

View solution in original post

Liev · ‎06-29-2020

Hi @aw30

The setup you're proposing is possible in DSS, but it will involve using several concepts simultaneously. However, you can also break it down into different flow elements and consolidate once you're comfortable with the logic and how everything fits together.

1 - Connecting to your bucket. Indeed if you want DSS to connect to the top level of the bucket you will receive a warning, but if this is where the files are, then you should be ok to proceed.

2 - Selecting which files you want and is everything needed? In your flow, go to + Dataset > Cloud Storage and Social > Amazon S3, select connection and bucket. Click on Show Advanced options > File selection > Only included. This will allow you to specify which of the files (given that each day several are written into the bucket) using a regex. This would specify the text part of the file name and use digits to represent the different timestamps.

3 - Is it one dataset or several? Since you have files such as

xxxx-06222020-users_export

xxxx-06222020-incidents_export

does this mean that you need one dataset for users_export and another for incidents_export? If so, then repeat and adjust step 2 to create multiple datasets.

4 - Enable partitioning on your dataset. Open the dataset and Go to Settings > Partitioning > Add Time dimension and select granularity from Year, Month, Day or Hour. In the pattern section, you need to put the pattern that will allow DSS to recognise the different files and assign to the correct partition.

using your examples the pattern might look something like this

xxxx-%M%D%Y-users_export

Verify that the partitions are being picked up correctly by pressing the List Partitions button. Repeat for any other dataset where the same is required.

5 - Now we're ready to sync. Click on the dataset and select the Sync recipe, create a new dataset as an output. Since you want to constantly only have the last results, this output dataset doesn't need to be partitioned. Make sure you're selecting Non Partitioned in the recipe creation popup.

Create the recipe. Under Configuration > Input select Latest available from the mapping function dropdown under the Input dataset name.

You're done, run it!

I hope this helps!

aw30 · ‎06-29-2020

HI Liev,

Thank you so much for responding! Unfortunately when I choose S3 there is nothing in the connection drop down to choose from so I will need to talk to my administrator. Additionally, the files are stored within HDFS so would I still be able to use your solution if this is the case? I read about a managed folder and am wondering if for files stored within HDFS if I need to go that route instead?

It may take a little while for our admin to adjust the connections and settings for me so I can mark this as solution works if you could answer this question as well.

Again, thank you for all the help on this!

Liev · ‎06-29-2020

yes, the described method would work on any kind of filesystem, so HDFS should be fine once you can see the files.

aw30 · ‎06-29-2020

Thank you so much!

Sign up to take part

Question on automating folder with multiple files

Question on automating folder with multiple files