S3 dataset with more than 1500 csv files

Mario_Burbano
Mario_Burbano Registered Posts: 12 ✭✭✭✭✭

Dear all,

When executing a spark query on a dataset created from S3 I get the following type of error:

Caused by: com.dataiku.dss.shadelib.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request

The URI in question contains more than 4000 csv files of about 50 MB each. I have launched the job multiple times and I observe that the error does not occur for the same file each time and that the error happens when reading files that are in the "positions" 1200 to 1400, e.g. for files that are named from part-00000 to part-04XXX the errors occur when reading part-012XX to part-014XX.

Thanks for your help

Best Answer

  • Mario_Burbano
    Mario_Burbano Registered Posts: 12 ✭✭✭✭✭
    Answer ✓

    Hi @CoreyS
    ,

    The problem was due to the duration requested by DSS for the STS token. By default DSS asked for a duration of 1 hour. This problem was observed on DSS version 7.0.2.

    We became aware that DSS version 8 had a parameter in the S3 connection configuration, namely STS token duration, which allowed us to specify a duration that was sufficiently long to perform the query in question.

    Regards,

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi, @Mario_Burbano
    ! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

Setup Info
    Tags
      Help me…