Pyspark Code to Spark Dataframe

Level 1
Pyspark Code to Spark Dataframe

I am new to Dataiku.  I have been going up and down all over the Dataiku document  and trying to make sense of this documentation specially looking into the  Dataiku  Pyspark code recipe  for my project, but I can not find anything useful !.  Examples and syntax  as simple as how to convert a Dataiku format to spark dataframe ! how to connect to multiple snowflake tables while working in the python notebook in spark recipe  ?  Details of the the methods of the  Dataiku library and what is available , what they do and how to use them ? looks like this class is not even explained well with details on the methods it offers as if  interfering  unnecessarily with a standard spark libraries which makes the coding harder rather than easier ! if anyone can point out a better link to look at to get more info about all above that will be appreciated. 



(Topic title edited by moderator to be more descriptive.)

0 Kudos
2 Replies
Dataiker Alumni

Hi @verleger and welcome to the Dataiku Community. The best way to receive a helpful response is to ensure that you post in the right area. In this case, your comment was moved as a question to the Using Dataiku discussion board. For more best practices, be sure to follow these guidelines: How do I ask a good question? 

In terms of your question, while you wait for a more complete response here are some resources you may find helpful if you haven't checked them out already:

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
0 Kudos

Hi @verleger ,

Just to add a bit to the documentation Corey shared there is a specific example of how to use pyspark in DSS here:

To ensure best performance please make sure you enable spark native integration on your snowflake connection:

To read a dataset you would use dkuspark all available methods are documented here :
For example here we would load 2 snowflake datasets 


# Import Dataiku APIs, including the PySpark layer
import dataiku
from dataiku import spark as dkuspark
# Import Spark APIs, both the base SparkContext and higher level SQLContext
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)

dataset1 = dataiku.Dataset("name_of_the_dataset")
df1 = dkuspark.get_dataframe(sqlContext, dataset1)

dataset2 = dataiku.Dataset("name_of_the_dataset")
df2 = dkuspark.get_dataframe(sqlContext, dataset2)



Additionally, you may want to consider snowpark integration:  the