Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a requirement under which I will receive one CSV file every monday ,That i need to replace in dataiku and then use sync and sql receipes to create the required dataset .
So now I wanted to automated this process of replacing the file from S3 Location to dataiku every monday , so that I can easily run the scenario to run the flow and create the final required table .
Please let me know the how I can automate this file replacement process?
Thanks in Advanvce!
How do you currently receive the input CSV file? Is this by email or another method? From your description, it sounds like you may be receiving the file manually right now and then uploading it to DSS manually. Then, the full process once the file is in DSS is automated. Is that understanding correct?
Since it sounds like receiving the input file happens outside of DSS, it will depend how you currently receive the file and if there are any automation options on that side of things. Do you know where the input file comes from? Is it from some other application, and if so, what application? Or, is there a process where the file is manually edited by a user and then passed along to you by email?
Depending on where the CSV file originates from, there may be options that you can investigate to automate directly placing the file on S3 or into an S3 managed folder in DSS.
I am receiving the file through email then with the help of some organization software we are bring the file in S3 folder , and then we are manually loading the csv file in DSS flow and then apply sync recipe to create the table in snowflake then further recipes to get required table.
Every Monday new file be available in S3 folder.
Now I want some way with which file get automatically replace in DSS flow every Monday.
Please guide me best way
Thanks In Advance!
Thank you for the additional details!
So if I understand it correctly, you already automated bringing the file into an S3 folder. In that case, you should indeed be able to automate this.
Instead of manually loading the CSV file into DSS, I would suggest creating a managed folder that points to your S3 folder. Then, you can use the "create dataset" option from the managed folder to create a dataset from the S3 folder:
This dataset can then be the input for your sync recipe. You can then use a scenario with a dataset modification trigger on the input S3 folder so that the scenario triggers automatically when the S3 folder receives the new file. The scenario can have a step that builds the Snowflake table and any other subsequent tables.
Does that makes sense to you as a potential workflow? If it seems like it might not quite work for your use case, please let me know how and I'm happy to take a look.
Thanks for reply
I am receiving 2 file on every Monday in this format 'filename_mmddyy.xlsx' format and previous files also present in the folder .
I don't thing building dataset method will work , it might work if instead of uploading the new file to the folder we replace the new file with the exising
please let me know some other possible way
Thanks in advance!
Thank you very much for the detailed reply .
The method given by you will work for one file , But I receive 2 new excel files in the same S3 folder every Monday in this naming convention 'filename_mmddyy.xlsx'.
And I need Both of these file in the flow get automatically replaced and then I will apply sync and SQL recipes to get the required output .
Could you please help me in this regard
Thanks in Advance!
For a similar problem, we use Python code based on AWS SDK (see https://boto3.amazonaws.com/v1/documentation/api/latest/index.html). With this you can do pretty much everything: manage files in your S3 buckets, load Excel and other files by mask, etc.
When the files are updated in S3, you simply need to re-run the flow.
Thank you for the additional details! My original thought was that you were looking to add the new files filename_mmddyy.xlsx to an existing dataset, so each week you would add the new files you received to your flow. It sounds like instead of that, you want to simply fully replace your existing data with the new files that you received this week, is that correct? So your flow will not have any historical data and will simply contain the new data from the most recent two files?
There are a lot of different options in that case. It might be easiest to do something simple like:
(1) Have a Python recipe that has the S3 folder as input and has an output folder of any type
(2) In your Python recipe, simply get the correct matching condition to pull the two "matching" files from this week and write them to your output folder
(3) This folder can then always simply contain just 2 files and can be the input for the rest of your flow
(4) You can automate this fully with a scenario that will perform the build and will always perform it after the files are automatically dropped to S3
Here is a brief example of a Python recipe that copies two files that match "yesterdays date" into a new output folder:
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # add the datetime library to allow for some date checking on your files import datetime # input folder that contains all files, regardless of date s3 = dataiku.Folder("qdLbzKxz") s3_info = s3.get_info() # this is our new _output_ folder that will contain only the 2 files for "this week". We'll write to it in this recipe updated_output_folder = dataiku.Folder("NBwjXsAn") # get the expected date for the files. You can build this from a scenario so that the recipe always builds at the expected day (i.e. the day after the file date). You can change this to be based off of today or any other date instead today = datetime.datetime.today() yesterday = datetime.datetime.today() - datetime.timedelta(days=1) compare_date = yesterday.strftime('%m%d%y') # if our filename contains our compare_date, then move it over to our "current data" output folder # this should result in just the two matching files in our output folder for file_path in s3.list_paths_in_partition(): if compare_date in file_path: print("Writing out the file ", file_path) with s3.get_download_stream(file_path) as f: updated_output_folder.upload_stream(file_path, f)
Now my flow looks like this instead:
Here are my example input files (i.e. from "last week" and "this week", obviously you'll have more historical data):
And my output folder, that contains only the two matching files from 090522:
I hope that helps. If you could use any other help, please pass along an example of last week's filenames and this week's file names as well.