Trying to store a text file to a folder on S3 that is not a managed folder in Dataiku

aw30 · February 2022

Hi - I have the following code in a pyspark recipe but it stores the contents of the file in 2 physical files. As you can see in the png the 2 files at the top were manually copied over to the folder. The empinfot1 and empinfo5 were created from the code below. You can see the managed folders worked fine but created cryptic names. How do I avoid the file from splitting into two? I tried both write.mode and write.csv.

# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, column, concat, lit

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Read recipe inputs
headcount_for_intdelivery = dataiku.Dataset("headcount_for_intdelivery")
headcount_for_intdelivery_df = dkuspark.get_dataframe(sqlContext, headcount_for_intdelivery)

s3_path = 's3://mypath/EMP_INFO3.txt'

#Write dataset
#headcount_for_intdelivery_df.write.mode("overwrite").text(s3_path)
headcount_for_intdelivery_df.write.csv(path=self.s3_path, header="true", mode="overwrite", sep="|")

Trying to store a text file to a folder on S3 that is not a managed folder in Dataiku

Categories

Setup Info

Tags