Efficient way to write massive dataset to output in Dataiku

Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭

Hi Team,

I'm using pyspark with Dataiku after processing the data, I'm facing an issue with writing the data to the output. Could you please suggest an efficient way to write the data to the output? Dataset size: 40million(approx.) Getting Error at line 15 while writing(as the data is massive)

#Recipe

1 import dataiku
2 from dataiku import spark as dkuspark
3 from pyspark import SparkContext
4 from pyspark.sql import SQLContext, SparkSession

5 sc = SparkSession.builder.enableHiveSupport().getOrCreate()
6 sqlContext = SQLContext(sc)

# Read recipe inputs
7 Table_A= dataiku.Dataset("A")
8 Table_A_df = dkuspark.get_dataframe(sqlContext, Table_A)

9 Table_B= dataiku.Dataset("B")
10 Table_B_df = dkuspark.get_dataframe(sqlContext, Table_B)

11 live_fact_table_df.createOrReplaceTempView("Table_A_df ")
12 live_date_dimension_df.createOrReplaceTempView("Table_B_df ")

#Preprocessing

13 output_df= sqlContext.sql(f""" SELECT * FROM Table_A INNER JOIN Table_B ON Table_A.id= Table_B.id""")

# Write recipe outputs

14 output = dataiku.Dataset("output")
15 dkuspark.write_with_schema(output, output_df)

Note: output_df is a pyspark dataframe

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.