Writing Data to s3 from Dataiku

Solved!
Ankur30
Level 3
Writing Data to s3 from Dataiku

Good Morning 

 

I am working on writing/appending data to s3 bucket from dataiku. But everytime i run my synch recipe a new csv file is created , i want data in a single csv file everytime i run my synch recipe. 

Kindly help me with solution. Please find attached screenshot.

 

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

The error suggests you are using code that writes to the local filesystem.

For non-filesystem managed folders (HDFS, S3, โ€ฆ), you need to use the various read/download and write/upload APIs.

For example use upload_stream() or upload_file() SeeL https://doc.dataiku.com/dss/latest/python-api/managed_folders.html  for more details.

Here is an generic example :

```
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

managed_folder_id = "URKU7Oqb"

# Read dataset convert df to csv inst
my_dataset = dataiku.Dataset("customers_labeled_prepared")
df = my_dataset.get_dataframe()

df.to_csv(index=False).encode("utf-8")

# Write recipe outputs
output_folder = dataiku.Folder(managed_folder_id)
output_folder.upload_stream("some_name.csv", df.to_csv(index=False).encode("utf-8"))
```

View solution in original post

0 Kudos
4 Replies
AlexT
Dataiker

Hi Ankur,

By default, any new data will be written to new files when syncing to an S3 Dataset.

To change this behavior you can edit the settings of the output dataset under Advanced - Force single output file and you can also set the file base name : 

Please refer to screenshot below:

Screenshot 2021-10-26 at 12.07.13.png

Let me know if that works for you. 

0 Kudos
Ankur30
Level 3
Author

Hi @AlexT ,

 

Thanks for this but I want to write all the input DSS datasets in the csv format to my s3 bucket using python recipe. But while writing I am getting error. Attached is the screenshot of error message.

 

Regards,

Ankur.

0 Kudos
AlexT
Dataiker

Hi,

The error suggests you are using code that writes to the local filesystem.

For non-filesystem managed folders (HDFS, S3, โ€ฆ), you need to use the various read/download and write/upload APIs.

For example use upload_stream() or upload_file() SeeL https://doc.dataiku.com/dss/latest/python-api/managed_folders.html  for more details.

Here is an generic example :

```
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

managed_folder_id = "URKU7Oqb"

# Read dataset convert df to csv inst
my_dataset = dataiku.Dataset("customers_labeled_prepared")
df = my_dataset.get_dataframe()

df.to_csv(index=False).encode("utf-8")

# Write recipe outputs
output_folder = dataiku.Folder(managed_folder_id)
output_folder.upload_stream("some_name.csv", df.to_csv(index=False).encode("utf-8"))
```
0 Kudos
Ankur30
Level 3
Author

Hi @AlexT ,

Thank you for all the help and support you have provided to me till now. Looking forward for your continued support. i really appreciate it.

 

Thank You,

Ankur.

0 Kudos