Common Crawl S3
 
            
                
                    mattmagic                
                
                    Registered Posts: 12 ✭✭✭✭                
            
                        
            
                    I am currently trying to connect the Common Crawl S3 to Dataiku.
I have tried different configurations. However I am not sure what to enter as "Access Key" and "Secret Key". I guess it is not my private AWS credential.
Does anyone have experience with that?
Thanks,
Matthew
                        I have tried different configurations. However I am not sure what to enter as "Access Key" and "Secret Key". I guess it is not my private AWS credential.
Does anyone have experience with that?
Thanks,
Matthew
            Tagged:
            
        
            Best Answer
- 
            Hi, thanks for your patience. Somehow, I can't manage to connect the commoncrawl bucket. 
 My most recent error is the following:So I am really unsure, whether you can access the bucket from dataiku or not. 
Answers
- 
            Hi,
 Credentials-less access to S3 is not supported. However, since the "commoncrawl" bucket is public, using your private AWS credentials will work
- 
            Hi, this is my current setup: However, when adding a S3 dataset I get the following error: "Could not list buckets: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: BFBCCF653E7B199D)" 
- 
            A google search suggests an issue with your credentials, https://stackoverflow.com/questions/2777078/amazon-mws-request-signature-calculated-does-not-match-the-signature-provided
- 
            Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user:
 {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Sid": "Stmt1503647467000",
 "Effect": "Allow",
 "Action": [
 "s3:GetObject",
 "s3:ListBucket"
 ],
 "Resource": [
 "arn:aws:s3:::commoncrawl/*",
 "arn:aws:s3:::commoncrawl"
 ]
 }
 ]
 }
- 
            That works, thanks a lot!

