Pyspark Code to Spark Dataframe

verleger · August 2022

I am new to Dataiku. I have been going up and down all over the Dataiku document and trying to make sense of this documentation specially looking into the Dataiku Pyspark code recipe for my project, but I can not find anything useful !. Examples and syntax as simple as how to convert a Dataiku format to spark dataframe ! how to connect to multiple snowflake tables while working in the python notebook in spark recipe ? Details of the the methods of the Dataiku library and what is available , what they do and how to use them ? looks like this class is not even explained well with details on the methods it offers as if interfering unnecessarily with a standard spark libraries which makes the coding harder rather than easier ! if anyone can point out a better link to look at to get more info about all above that will be appreciated.

Verleger

(Topic title edited by moderator to be more descriptive.)

CoreyS · August 2022

Hi @verleger
and welcome to the Dataiku Community. The best way to receive a helpful response is to ensure that you post in the right area. In this case, your comment was moved as a question to the Using Dataiku discussion board. For more best practices, be sure to follow these guidelines:

In terms of your question, while you wait for a more complete response here are some resources you may find helpful if you haven't checked them out already:

PySpark recipes (Documentation)
DSS and Spark (Documentation)
Python API interaction with PySpark (Documentation)
Using PySpark in Dataiku (Knowledge Base)

I hope this helps!

Alexandru · August 2022

Hi @verleger
,

Just to add a bit to the documentation Corey shared there is a specific example of how to use pyspark in DSS here:

https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe

To ensure best performance please make sure you enable spark native integration on your snowflake connection:

https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#spark-native-integration

To read a dataset you would use dkuspark all available methods are documented here :
For example here we would load 2 snowflake datasets

# Import Dataiku APIs, including the PySpark layer
import dataiku
from dataiku import spark as dkuspark
# Import Spark APIs, both the base SparkContext and higher level SQLContext
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)

dataset1 = dataiku.Dataset("name_of_the_dataset")
df1 = dkuspark.get_dataframe(sqlContext, dataset1)

dataset2 = dataiku.Dataset("name_of_the_dataset")
df2 = dkuspark.get_dataframe(sqlContext, dataset2)

Additionally, you may want to consider snowpark integration: the https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#snowpark-integration

Pyspark Code to Spark Dataframe

Answers

Categories

Setup Info

Tags