S3 dataset with more than 1500 csv files

Mario_Burbano · ‎11-24-2020

Dear all,

When executing a spark query on a dataset created from S3 I get the following type of error:

Caused by: com.dataiku.dss.shadelib.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request

The URI in question contains more than 4000 csv files of about 50 MB each. I have launched the job multiple times and I observe that the error does not occur for the same file each time and that the error happens when reading files that are in the "positions" 1200 to 1400, e.g. for files that are named from part-00000 to part-04XXX the errors occur when reading part-012XX to part-014XX.

Thanks for your help

Mario_Burbano · ‎01-05-2021

Hi @CoreyS,

The problem was due to the duration requested by DSS for the STS token. By default DSS asked for a duration of 1 hour. This problem was observed on DSS version 7.0.2.

We became aware that DSS version 8 had a parameter in the S3 connection configuration, namely STS token duration, which allowed us to specify a duration that was sufficiently long to perform the query in question.

Regards,

View solution in original post

CoreyS · ‎01-04-2021

Hi, @Mario_Burbano ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

Mario_Burbano · ‎01-05-2021

Hi @CoreyS,

The problem was due to the duration requested by DSS for the STS token. By default DSS asked for a duration of 1 hour. This problem was observed on DSS version 7.0.2.

We became aware that DSS version 8 had a parameter in the S3 connection configuration, namely STS token duration, which allowed us to specify a duration that was sufficiently long to perform the query in question.

Regards,

Sign up to take part

S3 dataset with more than 1500 csv files

S3 dataset with more than 1500 csv files