Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Using Sedona in Dataiku

aalabdullatif
Level 1
Using Sedona in Dataiku

I am trying to use dataiku with Sedona by doing the following:

import com.dataiku.dss.spark._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.spark.sql.SparkSession

val sparkConf = DataikuSparkContext.buildSparkConf()

val jars_list = Array(
"""/home/dataiku/jars/geotools-wrapper-1.1.0-25.2.jar""",
"""/home/dataiku/jars/sedona-sql-2.4_2.11-1.2.0-incubating.jar""",
"""/home/dataiku/jars/sedona-core-2.4_2.11-1.2.0-incubating.jar""",
"""/home/dataiku/jars/jts-core-1.18.1.jar""")

val jar_string = "file://" + jars_list.reduce((x,y) => x + "," + "file://" + y)

sparkConf.remove("spark.repl.local.jars")
sparkConf.remove("spark.yarn.dist.jars")
sparkConf.remove("spark.yarn.secondary.jars")
sparkConf.set("spark.repl.local.jars",jar_string)
sparkConf.set("spark.yarn.dist.jars",jar_string)
sparkConf.set("spark.sql.extentions","org.apache.sedona.sql.SedonaSqlExtensions")
sparkConf.set("spark.kryo.registrator","org.apache.sedona.core.serde.SedonaKryoRegistrator")

val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
val dkuContext = DataikuSparkContext.getContext(sparkContext)

 

val sedonapoc = dkuContext.getDataFrame(sqlContext, "GeogToWGS84GeoKey5")

val spark = SparkSession.builder().getOrCreate()
SedonaSQLRegistrator.registerAll(spark)

var geotiffDF = spark.read.format("geotiff").load("path_to_file.tif")

geotiffDF = geotiffDF.select("image.*")
geotiffDF = geotiffDF.withColumn("data",explode(geotiffDF.col("data")))

geotiffDF.createOrReplaceTempView("geotiffDF")

val sql_test = spark.sql("""
select nBands from geotiffDF
""")

geotiffDF.show()

 

this code works just fine when we are processing small files, however when i am dealing with larger files i get out of memory error

 

Is it possible that usingxnstead of "DataikuSparkContext" results with the job being executed in client mode not cluster mode? what is the consequences of running spark job using  DataikuSparkContext?

 

0 Kudos
0 Replies