Loading the data from file(S3 location) though all expected files are not available

Vinothkumar
Level 2
Loading the data from file(S3 location) though all expected files are not available

My flow contains reading 5 input files from s3 bucket based on trigger file.

Sometime it is not necessary to get all 5 files. But still trigger files will be placed. My flow works well when all files are present. But it fails when one or more files are missing and scenario is started.

Is there any way to skip the part of the flow for which we dont have file? Or any other solution to overcome this issue?

My current design is:

 

Folder->Create dataset(for all 5 files)->Sync->Stack recipe to combine the data ->Final calculation

 

Thanks,

Vinothkumar M

0 Kudos
3 Replies
Vinothkumar
Level 2
Author

Just to add a note. 

From the trigger file, i can find what are all the files are placed in the input path.

So writing python code based on the trigger file content. But unable to read excel files from the path.

with handle1.get_download_stream('/Sites.xlsx') as f:
     data=f.readline()
   

PK!|l๏ฟฝi๏ฟฝ[Content_Types].xml ๏ฟฝ(๏ฟฝฬ”๏ฟฝj๏ฟฝ0๏ฟฝ๏ฟฝ{๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝD๏ฟฝ1๏ฟฝ^๏ฟฝ๏ฟฝ๏ฟฝ&๏ฟฝ=@ึœ๏ฟฝ`๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝwu๏ฟฝ๏ฟฝ)๏ฟฝ๏ฟฝ๏ฟฝ4๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ_O๏ฟฝo8^5&[B@๏ฟฝl๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ2๏ฟฝ๏ฟฝS๏ฟฝ๏ฟฝ

 

Is there a way to read excel specifically from the S3 and i want to use that as a DataFrame.

@Andrey ,@Nicolas_Servel any thoughts?

Thanks,

Vinothkumar M 

0 Kudos
Kiran
Level 2

Hello,

Not sure if your still looking for this, I just came across this when looking for something similar solution. For me Pandas read_excel worked. 

In Dataiku I created a folder for my s3 that have some excel files in them and created a Python recipe to read my excel files for the s3 folder. 

s3_folder = dataiku.Folder('vx343....')

 

df = pd.read_excel(s3_folder.get_download_stream('my_excel.xlsx'),sheet_name='mydata')

 

Hope this is helpful for you or others who come across this thread. 

 

Thanks,

Kiran

Vinothkumar
Level 2
Author
1.First option:
Tried to read the files which is available in the paths. Able to read the file which is in csv format. but unable to read which is in excel.Not so sure why.But looks DSS mainly supports txt n csv
Code:
with handle1.get_download_stream('/dqs/DQS_Reference Study Sites.csv') as f:

data=f.readlines() ##able to read csv.But in the same place if i keep excel and try to read.It comes as kind of xml component.
2.Second option:
Instead of reading excel via python.If we able to create a empty excel with specific headers(as like original file) and place in s3.So that the regular flow will be able to run that
But again here i am able to place the empty dataframe with just columns alone as a csv file.But the same way im not able to move excel file.
Code:
with handle1.get_writer(Filename) as writer:
writer.write(network_df.to_csv().encode("utf-8"))#Working fine.but the same to_excel not working.
 
So if any one option works fine then that will solve my problem. Can someone help me here?
0 Kudos