Read CSVs from S3 folder, process, write processed CSVs to S3 folder

Solved!
meevans1
Level 2
Read CSVs from S3 folder, process, write processed CSVs to S3 folder

How should I:

1. Read CSVs from an S3 folder

2. Process these CSVs with custom python code

3. Write these processed CSVs to an S3 folder. A different folder to the input I guess.

Thanks in advance

0 Kudos
1 Solution
Ignacio_Toledo

Hi @meevans1. I think it is important to add one extra information to the process described by @MiguelangelC, and it has to do with the api calls you will need to use to read and write data in a S3 bucket connected as a Folder in dataiku: you'll need to use "get_download_stream" and "upload_stream" for reading and writing operations, or install the boto3 library.

Hope this helps, after following the instructions from @MiguelangelC

Cheers.

View solution in original post

2 Replies
MiguelangelC
Dataiker

Hi,

1)
In order to connect to a S3 bucket you first need to have such a connection defined in your DSS instance.
You can create new connections in your node by going to Administration > Connections > New Connection > Amazon S3. You can follow the documentation on the necessary requisites: https://doc.dataiku.com/dss/latest/connecting/s3.html

Once you have set up the connection, from the Flow a dataset or folder can be created pointing to the S3 connection and the particular file/path in the bucket.

2)
This can be done by either using a code recipe or a notebook depending on your requirements

3)
Provided you are using the same details from the already existing S3 connection, you can reuse it to write the data wherever you want in the bucket.

Since these questions deal with the basic functionalities of DSS, I think you'd benefit greatly from going through the basic DSS learning path: https://academy.dataiku.com/path/core-designer

Ignacio_Toledo

Hi @meevans1. I think it is important to add one extra information to the process described by @MiguelangelC, and it has to do with the api calls you will need to use to read and write data in a S3 bucket connected as a Folder in dataiku: you'll need to use "get_download_stream" and "upload_stream" for reading and writing operations, or install the boto3 library.

Hope this helps, after following the instructions from @MiguelangelC

Cheers.