Loading from a ftp a dataset with a changing name

bored_panda
bored_panda Registered Posts: 11 ✭✭✭✭
Hello

I would like to have a dataset loaded each day (with a scenario, I guess) from a FTP.

However, I want my DSS dataset to load a different file from the FTP each day : dataset_2016-12-01.txt today, dataset_2016-12-02.txt tomorrow, and so on.

How can I do that ? Is there some kind of global variable for the date, that I can inser in the name of the targeted source file ? (If not, can I build one ?)

Thanks

Answers

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker

    You can use partitioning (by date) on an Uncached FTP dataset: you specify just / as the path, the in the Partitioning tab, activate partitioning, add a time dimension and set the partitioning pattern to dataset_%Y-%M-%D.txt.

    You can then create a Sync recipe from this dataset to another dataset on a local storage connection (e.g. File System) with the same partitioning (the default is to keep the same partitioning and and equals partition dependency.

    Finally, you can build this dataset (or other downstream datasets with the same partitioning) from a Scenario: just add a build step and specify CURRENT_DAY in the partition (note: you have other placeholders, like PREVIOUS_DAY, should you need them).

    Hope this helps.

  • ridwanoabdulaze
    ridwanoabdulaze Registered Posts: 3 ✭✭✭✭
    This is helpful, @andrian Laoillote, is there a way to store the output dataset in this format dataset_%Y-%M-%D.txt after running the scenario?
  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    That is not doable easily in DSS. You can setup an output dataset with partitioning dataset_%Y-%M-%D/.* and, in advanced options, click "Force single output file" to get only one file, but it will still be "dataset_2019-04-18/out-s0.csv.gz". If you want to change that, you'll have to write a python or shell recipe for instance.
  • ridwanoabdulaze
    ridwanoabdulaze Registered Posts: 3 ✭✭✭✭
    Thanks for your prompt response, I find it helpful. The output dataset that was created after using the Sync recipe (as per your first comment) keeps appending to the existing dataset instead of overwriting using the dataset_2019-04-18.csv file . Therefore the file "dataset_2019-04-18/out-s0.csv" has both the data for both "dataset_2019-04-18" and "dataset_2019-03-18" after running the Scenario. Note: it works fine with dataset_2019-04-18.csv because it was the first run
  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Check whether the parent recipe for that dataset is in append mode. You can see that in the Input/Output tab of the recipe, there is a checkbox labelled "Append instead of overwrite". If not, I suggest opening a support ticket with a job diagnostic.
  • ridwanoabdulaze
    ridwanoabdulaze Registered Posts: 3 ✭✭✭✭
    Hi @Adrian lavoillotte, thanks for your response. It is not in append mode. How can I open a ticket? And note that I am reading the day from hdfs not ftp
  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Click the "?" icon on the top-right corner, then choose "Get help"
Setup Info
    Tags
      Help me…