Pyspark Code to Spark Dataframe
I am new to Dataiku. I have been going up and down all over the Dataiku document and trying to make sense of this documentation specially looking into the Dataiku Pyspark code recipe for my project, but I can not find anything useful !. Examples and syntax as simple as how to convert a Dataiku format to spark dataframe ! how to connect to multiple snowflake tables while working in the python notebook in spark recipe ? Details of the the methods of the Dataiku library and what is available , what they do and how to use them ? looks like this class is not even explained well with details on the methods it offers as if interfering unnecessarily with a standard spark libraries which makes the coding harder rather than easier ! if anyone can point out a better link to look at to get more info about all above that will be appreciated.
Verleger
(Topic title edited by moderator to be more descriptive.)
Answers
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi @verleger
and welcome to the Dataiku Community. The best way to receive a helpful response is to ensure that you post in the right area. In this case, your comment was moved as a question to the Using Dataiku discussion board. For more best practices, be sure to follow these guidelines:In terms of your question, while you wait for a more complete response here are some resources you may find helpful if you haven't checked them out already:
- PySpark recipes (Documentation)
- DSS and Spark (Documentation)
- Python API interaction with PySpark (Documentation)
- Using PySpark in Dataiku (Knowledge Base)
I hope this helps!
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @verleger
,Just to add a bit to the documentation Corey shared there is a specific example of how to use pyspark in DSS here:
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe
To ensure best performance please make sure you enable spark native integration on your snowflake connection:
https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#spark-native-integration
To read a dataset you would use dkuspark all available methods are documented here :
For example here we would load 2 snowflake datasets# Import Dataiku APIs, including the PySpark layer import dataiku from dataiku import spark as dkuspark # Import Spark APIs, both the base SparkContext and higher level SQLContext from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext() sqlContext = SQLContext(sc) dataset1 = dataiku.Dataset("name_of_the_dataset") df1 = dkuspark.get_dataframe(sqlContext, dataset1) dataset2 = dataiku.Dataset("name_of_the_dataset") df2 = dkuspark.get_dataframe(sqlContext, dataset2)
Additionally, you may want to consider snowpark integration: the https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#snowpark-integration