Using Sedona in Dataiku

aalabdullatif · September 2022

I am trying to use dataiku with Sedona by doing the following:

import com.dataiku.dss.spark._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.spark.sql.SparkSession

val sparkConf = DataikuSparkContext.buildSparkConf()

val jars_list = Array(
"""/home/dataiku/jars/geotools-wrapper-1.1.0-25.2.jar""",
"""/home/dataiku/jars/sedona-sql-2.4_2.11-1.2.0-incubating.jar""",
"""/home/dataiku/jars/sedona-core-2.4_2.11-1.2.0-incubating.jar""",
"""/home/dataiku/jars/jts-core-1.18.1.jar""")

val jar_string = "file://" + jars_list.reduce((x,y) => x + "," + "file://" + y)

sparkConf.remove("spark.repl.local.jars")
sparkConf.remove("spark.yarn.dist.jars")
sparkConf.remove("spark.yarn.secondary.jars")
sparkConf.set("spark.repl.local.jars",jar_string)
sparkConf.set("spark.yarn.dist.jars",jar_string)
sparkConf.set("spark.sql.extentions","org.apache.sedona.sql.SedonaSqlExtensions")
sparkConf.set("spark.kryo.registrator","org.apache.sedona.core.serde.SedonaKryoRegistrator")

val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
val dkuContext = DataikuSparkContext.getContext(sparkContext)

val sedonapoc = dkuContext.getDataFrame(sqlContext, "GeogToWGS84GeoKey5")

val spark = SparkSession.builder().getOrCreate()
SedonaSQLRegistrator.registerAll(spark)

var geotiffDF = spark.read.format("geotiff").load("path_to_file.tif")

geotiffDF = geotiffDF.select("image.*")
geotiffDF = geotiffDF.withColumn("data",explode(geotiffDF.col("data")))

geotiffDF.createOrReplaceTempView("geotiffDF")

val sql_test = spark.sql("""
select nBands from geotiffDF
""")

geotiffDF.show()

this code works just fine when we are processing small files, however when i am dealing with larger files i get out of memory error

Is it possible that usingxnstead of "DataikuSparkContext" results with the job being executed in client mode not cluster mode? what is the consequences of running spark job using DataikuSparkContext?

Using Sedona in Dataiku

Categories

Setup Info

Tags