Loading from a ftp a dataset with a changing name
bored_panda
Registered Posts: 11 ✭✭✭✭
Hello
I would like to have a dataset loaded each day (with a scenario, I guess) from a FTP.
However, I want my DSS dataset to load a different file from the FTP each day : dataset_2016-12-01.txt today, dataset_2016-12-02.txt tomorrow, and so on.
How can I do that ? Is there some kind of global variable for the date, that I can inser in the name of the targeted source file ? (If not, can I build one ?)
Thanks
I would like to have a dataset loaded each day (with a scenario, I guess) from a FTP.
However, I want my DSS dataset to load a different file from the FTP each day : dataset_2016-12-01.txt today, dataset_2016-12-02.txt tomorrow, and so on.
How can I do that ? Is there some kind of global variable for the date, that I can inser in the name of the targeted source file ? (If not, can I build one ?)
Thanks
Tagged:
Answers
-
You can use partitioning (by date) on an Uncached FTP dataset: you specify just / as the path, the in the Partitioning tab, activate partitioning, add a time dimension and set the partitioning pattern to dataset_%Y-%M-%D.txt.
You can then create a Sync recipe from this dataset to another dataset on a local storage connection (e.g. File System) with the same partitioning (the default is to keep the same partitioning and and equals partition dependency.
Finally, you can build this dataset (or other downstream datasets with the same partitioning) from a Scenario: just add a build step and specify CURRENT_DAY in the partition (note: you have other placeholders, like PREVIOUS_DAY, should you need them).
Hope this helps.
-
This is helpful, @andrian Laoillote, is there a way to store the output dataset in this format dataset_%Y-%M-%D.txt after running the scenario?
-
That is not doable easily in DSS. You can setup an output dataset with partitioning dataset_%Y-%M-%D/.* and, in advanced options, click "Force single output file" to get only one file, but it will still be "dataset_2019-04-18/out-s0.csv.gz". If you want to change that, you'll have to write a python or shell recipe for instance.
-
Thanks for your prompt response, I find it helpful. The output dataset that was created after using the Sync recipe (as per your first comment) keeps appending to the existing dataset instead of overwriting using the dataset_2019-04-18.csv file . Therefore the file "dataset_2019-04-18/out-s0.csv" has both the data for both "dataset_2019-04-18" and "dataset_2019-03-18" after running the Scenario. Note: it works fine with dataset_2019-04-18.csv because it was the first run
-
Check whether the parent recipe for that dataset is in append mode. You can see that in the Input/Output tab of the recipe, there is a checkbox labelled "Append instead of overwrite". If not, I suggest opening a support ticket with a job diagnostic.
-
Hi @Adrian lavoillotte, thanks for your response. It is not in append mode. How can I open a ticket? And note that I am reading the day from hdfs not ftp
-
Click the "?" icon on the top-right corner, then choose "Get help"