Loading from a ftp a dataset with a changing name

bored_panda · December 2016

Hello

I would like to have a dataset loaded each day (with a scenario, I guess) from a FTP.

However, I want my DSS dataset to load a different file from the FTP each day : dataset_2016-12-01.txt today, dataset_2016-12-02.txt tomorrow, and so on.

How can I do that ? Is there some kind of global variable for the date, that I can inser in the name of the targeted source file ? (If not, can I build one ?)

Thanks

AdrienL · December 2016

You can use partitioning (by date) on an Uncached FTP dataset: you specify just / as the path, the in the Partitioning tab, activate partitioning, add a time dimension and set the partitioning pattern to dataset_%Y-%M-%D.txt.

You can then create a Sync recipe from this dataset to another dataset on a local storage connection (e.g. File System) with the same partitioning (the default is to keep the same partitioning and and equals partition dependency.

Finally, you can build this dataset (or other downstream datasets with the same partitioning) from a Scenario: just add a build step and specify CURRENT_DAY in the partition (note: you have other placeholders, like PREVIOUS_DAY, should you need them).

Hope this helps.

ridwanoabdulaze · April 2019

This is helpful, @andrian Laoillote, is there a way to store the output dataset in this format dataset_%Y-%M-%D.txt after running the scenario?

AdrienL · April 2019

That is not doable easily in DSS. You can setup an output dataset with partitioning dataset_%Y-%M-%D/.* and, in advanced options, click "Force single output file" to get only one file, but it will still be "dataset_2019-04-18/out-s0.csv.gz". If you want to change that, you'll have to write a python or shell recipe for instance.

ridwanoabdulaze · April 2019

Thanks for your prompt response, I find it helpful. The output dataset that was created after using the Sync recipe (as per your first comment) keeps appending to the existing dataset instead of overwriting using the dataset_2019-04-18.csv file . Therefore the file "dataset_2019-04-18/out-s0.csv" has both the data for both "dataset_2019-04-18" and "dataset_2019-03-18" after running the Scenario. Note: it works fine with dataset_2019-04-18.csv because it was the first run

AdrienL · April 2019

Check whether the parent recipe for that dataset is in append mode. You can see that in the Input/Output tab of the recipe, there is a checkbox labelled "Append instead of overwrite". If not, I suggest opening a support ticket with a job diagnostic.

ridwanoabdulaze · April 2019

Hi @Adrian lavoillotte, thanks for your response. It is not in append mode. How can I open a ticket? And note that I am reading the day from hdfs not ftp

AdrienL · April 2019

Click the "?" icon on the top-right corner, then choose "Get help"

Loading from a ftp a dataset with a changing name

Answers

Categories

Setup Info

Tags