S3 dataset with more than 1500 csv files

Solved!
Mario_Burbano
Level 2
S3 dataset with more than 1500 csv files

Dear all, 

When executing a spark query on a dataset created from S3 I get the following type of error: 

 

Caused by: com.dataiku.dss.shadelib.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request

 

The URI in question contains more than 4000 csv files of about 50 MB each. I have launched the job multiple times and I observe that the error does not occur for the same file each time and that the error happens when reading files that are in the "positions" 1200 to 1400, e.g. for files that are named from part-00000 to part-04XXX the errors occur when reading part-012XX to part-014XX. 

Thanks for your help

0 Kudos
1 Solution
Mario_Burbano
Level 2
Author

Hi @CoreyS

The problem was due to the duration requested by DSS for the STS token. By default DSS asked for a duration of 1 hour. This problem was observed on DSS version 7.0.2. 

We became aware that DSS version 8 had a parameter in the S3 connection configuration, namely STS token duration, which allowed us to specify a duration that was sufficiently long to perform the query in question. 

Regards, 

View solution in original post

2 Replies
CoreyS
Dataiker Alumni

Hi, @Mario_Burbano ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if youโ€™ve tried any fixes already?This should lead to a quicker response from the community.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos
Mario_Burbano
Level 2
Author

Hi @CoreyS

The problem was due to the duration requested by DSS for the STS token. By default DSS asked for a duration of 1 hour. This problem was observed on DSS version 7.0.2. 

We became aware that DSS version 8 had a parameter in the S3 connection configuration, namely STS token duration, which allowed us to specify a duration that was sufficiently long to perform the query in question. 

Regards,