DataFrame developed with PySpark remains running without yielding any results.

Cancun_Mx
Level 1
DataFrame developed with PySpark remains running without yielding any results.

Hi everyone,

I´ve a challengue with a jupyther notebook using pyspark. The trouble is when I try to instance a dataframe with the instruction write_with_schema. The complete sentence are:

import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Read recipe inputs
VW_DL_AUTSINIESTROS = dataiku.Dataset("VW_DL_AUTSINIESTROS")
VW_DL_AUTSINIESTROS_df = dkuspark.get_dataframe(sqlContext, VW_DL_AUTSINIESTROS)
P3D = dataiku.Dataset("P3D")
P3D_df = dkuspark.get_dataframe(sqlContext, P3D)

------ cell #2

# Cambiar a string
P3D_df = P3D_df.withColumn('NUMCOMPLETOCOTIZACION', P3D_df.NUMCOMPLETOCOTIZACION.cast('string'))
VW_DL_AUTSINIESTROS_df = VW_DL_AUTSINIESTROS_df.withColumn('NUMCOMPLETOCOTIZACION', VW_DL_AUTSINIESTROS_df.NUMCOMPLETOCOTIZACION.cast('string'))

------- cell #3

tabla = P3D_df.join(VW_DL_AUTSINIESTROS_df, on=['NUMCOMPLETOCOTIZACION'], how='left')
display(tabla)

-------- cell #4

cols = ['FECHA_EMISION', 'NUMCOMPLETOCOTIZACION', 'OCURRIDO_NETO']

-------- cell #5

tabla = tabla.select(*cols)

-------- cell #6

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframe
siniestros_elyvv_df = tabla # For this sample code, simply copy input to output

-------- cell #7

# Write recipe outputs
siniestros_elyvv = dataiku.Dataset("siniestros_elyvv")
display(siniestros_elyvv)

-------- cell #8

dkuspark.write_with_schema(siniestros_elyvv,siniestros_elyvv_df)

#here is where the instruction is on a loop.

 

I really appreciate if help me with any ideas. Thanks in advance.

HP


Operating system used: CentOS 7

0 Kudos
1 Reply
AlexT
Dataiker

Hi @Cancun_Mx ,
Troubleshooting spark code from a notebook will be very difficult.
I would suggest you try the same code in PySpark recipe instead and then review the job diagnostics or open a support ticket with the job diagnostics.

Thanks,

0 Kudos