Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi - I have the following code in a pyspark recipe but it stores the contents of the file in 2 physical files. As you can see in the png the 2 files at the top were manually copied over to the folder. The empinfot1 and empinfo5 were created from the code below. You can see the managed folders worked fine but created cryptic names. How do I avoid the file from splitting into two? I tried both write.mode and write.csv.
# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, column, concat, lit
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
# Read recipe inputs
headcount_for_intdelivery = dataiku.Dataset("headcount_for_intdelivery")
headcount_for_intdelivery_df = dkuspark.get_dataframe(sqlContext, headcount_for_intdelivery)
s3_path = 's3://mypath/EMP_INFO3.txt'
#Write dataset
#headcount_for_intdelivery_df.write.mode("overwrite").text(s3_path)
headcount_for_intdelivery_df.write.csv(path=self.s3_path, header="true", mode="overwrite", sep="|")