Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am new to Dataiku. I have been going up and down all over the Dataiku document and trying to make sense of this documentation specially looking into the Dataiku Pyspark code recipe for my project, but I can not find anything useful !. Examples and syntax as simple as how to convert a Dataiku format to spark dataframe ! how to connect to multiple snowflake tables while working in the python notebook in spark recipe ? Details of the the methods of the Dataiku library and what is available , what they do and how to use them ? looks like this class is not even explained well with details on the methods it offers as if interfering unnecessarily with a standard spark libraries which makes the coding harder rather than easier ! if anyone can point out a better link to look at to get more info about all above that will be appreciated.
(Topic title edited by moderator to be more descriptive.)
Hi @verleger and welcome to the Dataiku Community. The best way to receive a helpful response is to ensure that you post in the right area. In this case, your comment was moved as a question to the Using Dataiku discussion board. For more best practices, be sure to follow these guidelines: How do I ask a good question?
In terms of your question, while you wait for a more complete response here are some resources you may find helpful if you haven't checked them out already:
I hope this helps!
Hi @verleger ,
Just to add a bit to the documentation Corey shared there is a specific example of how to use pyspark in DSS here:
To ensure best performance please make sure you enable spark native integration on your snowflake connection:
To read a dataset you would use dkuspark all available methods are documented here :
For example here we would load 2 snowflake datasets
# Import Dataiku APIs, including the PySpark layer import dataiku from dataiku import spark as dkuspark # Import Spark APIs, both the base SparkContext and higher level SQLContext from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext() sqlContext = SQLContext(sc) dataset1 = dataiku.Dataset("name_of_the_dataset") df1 = dkuspark.get_dataframe(sqlContext, dataset1) dataset2 = dataiku.Dataset("name_of_the_dataset") df2 = dkuspark.get_dataframe(sqlContext, dataset2)
Additionally, you may want to consider snowpark integration: the https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#snowpark-integration