Community Conundrum 25:Feature Visualization is now live! Read More

Common Crawl S3

Level 2
Common Crawl S3
I am currently trying to connect the Common Crawl S3 to Dataiku.

I have tried different configurations. However I am not sure what to enter as "Access Key" and "Secret Key". I guess it is not my private AWS credential.
Does anyone have experience with that?

Thanks,

Matthew
0 Kudos
6 Replies
Dataiker
Dataiker
Hi,

Credentials-less access to S3 is not supported. However, since the "commoncrawl" bucket is public, using your private AWS credentials will work
0 Kudos
Level 2
Author

Hi,



this is my current setup:





However, when adding a S3 dataset I get the following error:



"Could not list buckets: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: BFBCCF653E7B199D)"



 

0 Kudos
Dataiker
Dataiker
A google search suggests an issue with your credentials, https://stackoverflow.com/questions/2777078/amazon-mws-request-signature-calculated-does-not-match-the-signature-provided
0 Kudos
Level 2
Author

Hi,



thanks for your patience. Somehow, I can't manage to connect the commoncrawl bucket.

My most recent error is the following:



So I am really unsure, whether you can access the bucket from dataiku or not.

0 Kudos
Dataiker
Dataiker
Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1503647467000",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::commoncrawl/*",
"arn:aws:s3:::commoncrawl"
]
}
]
}
0 Kudos
Level 2
Author
That works, thanks a lot!
0 Kudos
Labels (3)