Automatically replace file in dataiku from S3 Bucket

PK36313
PK36313 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 10 Partner

Hi

I have a requirement under which I will receive one CSV file every monday ,That i need to replace in dataiku and then use sync and sql receipes to create the required dataset .

So now I wanted to automated this process of replacing the file from S3 Location to dataiku every monday , so that I can easily run the scenario to run the flow and create the final required table .

Please let me know the how I can automate this file replacement process?

Thanks in Advanvce!

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @PK36313
    ,

    How do you currently receive the input CSV file? Is this by email or another method? From your description, it sounds like you may be receiving the file manually right now and then uploading it to DSS manually. Then, the full process once the file is in DSS is automated. Is that understanding correct?

    Since it sounds like receiving the input file happens outside of DSS, it will depend how you currently receive the file and if there are any automation options on that side of things. Do you know where the input file comes from? Is it from some other application, and if so, what application? Or, is there a process where the file is manually edited by a user and then passed along to you by email?

    Depending on where the CSV file originates from, there may be options that you can investigate to automate directly placing the file on S3 or into an S3 managed folder in DSS.

    Thanks,
    Sarina

  • PK36313
    PK36313 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 10 Partner

    Hi @SarinaS

    I am receiving the file through email then with the help of some organization software we are bring the file in S3 folder , and then we are manually loading the csv file in DSS flow and then apply sync recipe to create the table in snowflake then further recipes to get required table.

    Every Monday new file be available in S3 folder.

    Now I want some way with which file get automatically replace in DSS flow every Monday.

    Please guide me best way

    Thanks In Advance!

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @PK36313
    ,

    Thank you for the additional details!

    So if I understand it correctly, you already automated bringing the file into an S3 folder. In that case, you should indeed be able to automate this.

    Instead of manually loading the CSV file into DSS, I would suggest creating a managed folder that points to your S3 folder. Then, you can use the "create dataset" option from the managed folder to create a dataset from the S3 folder:

    Screen Shot 2022-09-01 at 11.05.37 AM.png

    This dataset can then be the input for your sync recipe. You can then use a scenario with a dataset modification trigger on the input S3 folder so that the scenario triggers automatically when the S3 folder receives the new file. The scenario can have a step that builds the Snowflake table and any other subsequent tables.

    Does that makes sense to you as a potential workflow? If it seems like it might not quite work for your use case, please let me know how and I'm happy to take a look.

    Thank you,
    Sarina

  • PK36313
    PK36313 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 10 Partner

    Hi @SarinaS

    Thanks for reply

    I am receiving 2 file on every Monday in this format 'filename_mmddyy.xlsx' format and previous files also present in the folder .

    I don't thing building dataset method will work , it might work if instead of uploading the new file to the folder we replace the new file with the exising

    please let me know some other possible way

    Thanks in advance!

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @PK36313
    ,

    It still seems to me that using the files directly from S3 will be the easiest approach. Here is a little example video fully outlining this thinking in case that helps.

    Can you check it out and let me know if this still won't meet your use case?

    Thanks,
    Sarina

  • PK36313
    PK36313 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 10 Partner

    Hi @SarinaS

    Thank you very much for the detailed reply .

    The method given by you will work for one file , But I receive 2 new excel files in the same S3 folder every Monday in this naming convention 'filename_mmddyy.xlsx'.

    And I need Both of these file in the flow get automatically replaced and then I will apply sync and SQL recipes to get the required output .

    Could you please help me in this regard

    Thanks in Advance!

  • NikolayK
    NikolayK Partner, Registered Posts: 14 Partner

    For a similar problem, we use Python code based on AWS SDK (see https://boto3.amazonaws.com/v1/documentation/api/latest/index.html). With this you can do pretty much everything: manage files in your S3 buckets, load Excel and other files by mask, etc.

    When the files are updated in S3, you simply need to re-run the flow.

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @PK36313
    ,

    Thank you for the additional details! My original thought was that you were looking to add the new files filename_mmddyy.xlsx to an existing dataset, so each week you would add the new files you received to your flow. It sounds like instead of that, you want to simply fully replace your existing data with the new files that you received this week, is that correct? So your flow will not have any historical data and will simply contain the new data from the most recent two files?

    There are a lot of different options in that case. It might be easiest to do something simple like:
    (1) Have a Python recipe that has the S3 folder as input and has an output folder of any type
    (2) In your Python recipe, simply get the correct matching condition to pull the two "matching" files from this week and write them to your output folder
    (3) This folder can then always simply contain just 2 files and can be the input for the rest of your flow
    (4) You can automate this fully with a scenario that will perform the build and will always perform it after the files are automatically dropped to S3

    Here is a brief example of a Python recipe that copies two files that match "yesterdays date" into a new output folder:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    # add the datetime library to allow for some date checking on your files 
    import datetime 
    
    # input folder that contains all files, regardless of date
    s3 = dataiku.Folder("qdLbzKxz")
    s3_info = s3.get_info()
    
    # this is our new _output_ folder that will contain only the 2 files for "this week". We'll write to it in this recipe 
    updated_output_folder = dataiku.Folder("NBwjXsAn")
    
    # get the expected date for the files. You can build this from a scenario so that the recipe always builds at the expected day (i.e. the day after the file date). You can change this to be based off of today or any other date instead 
    today = datetime.datetime.today() 
    yesterday = datetime.datetime.today() - datetime.timedelta(days=1)
    compare_date = yesterday.strftime('%m%d%y')
    
    # if our filename contains our compare_date, then move it over to our "current data" output folder 
    # this should result in just the two matching files in our output folder
    for file_path in s3.list_paths_in_partition():
        if compare_date in file_path: 
            print("Writing out the file ", file_path)
            with s3.get_download_stream(file_path) as f: 
                updated_output_folder.upload_stream(file_path, f)


    Now my flow looks like this instead:

    Screen Shot 2022-09-06 at 12.23.05 PM.png


    Here are my example input files (i.e. from "last week" and "this week", obviously you'll have more historical data):

    Screen Shot 2022-09-06 at 12.23.12 PM.png

    And my output folder, that contains only the two matching files from 090522:

    Screen Shot 2022-09-06 at 12.23.19 PM.png

    I hope that helps. If you could use any other help, please pass along an example of last week's filenames and this week's file names as well.

    Thanks,
    Sarina 

Setup Info
    Tags
      Help me…