How to connect pyspark instance to multiple projects?

nmadhu20
How to connect pyspark instance to multiple projects?

Hi Team,

I need to connect with different projects in a loop and read required datasets using pyspark.

But once I have created the sqlContext, it maps to current project and does not point to the required project which throws an error at 'dataiku.spark.get_dataframe()'. It works fine if I hardcode the name of current project and any dataset in it.

Adding a sample code snippet below and the error for your clarity.

Code:

import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)

for project in prj_dict():

     client = dataiku.api_client()

     #connect with required project
     dataiku.set_default_project_key(project)

     #read the required dataset
     dataset_input = dataiku.Dataset(prj_dict[project])
     dat_input = dataiku.spark.get_dataframe(sqlContext, dataset_input)

 

Error :

Py4JJavaError: An error occurred while calling o339.getPyDataFrame.
: java.lang.Error: No variables context for project PROJECT
	at com.dataiku.dip.variables.ManualVariablesService.getContext(ManualVariablesService.java:30)
	at com.dataiku.dip.variables.VariablesUtils.expand(VariablesUtils.java:17)
	at com.dataiku.dip.datasets.fs.BlobLikeDatasetHandler.<init>(BlobLikeDatasetHandler.java:65)
	at com.dataiku.dip.datasets.fs.S3DatasetHandler.<init>(S3DatasetHandler.java:72)
	at com.dataiku.dip.datasets.fs.BuiltinFSDatasets$3.build(BuiltinFSDatasets.java:315)
	at com.dataiku.dip.input.DatasetHandlerFactory.build(DatasetHandlerFactory.java:54)
	at com.dataiku.dip.spark.StorageBackendsSupport$.isS3DatasetCompatible(StorageBackendsSupport.scala:81)
	at com.dataiku.dip.spark.FastPathHandler.get(FastPathHandler.scala:53)
	at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrameInternal(StdDataikuSparkContext.scala:278)
	at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrame(StdDataikuSparkContext.scala:155)
	at com.dataiku.dip.spark.StdDataikuSparkContext.getPyDataFrame(StdDataikuSparkContext.scala:521)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

 

0 Kudos
2 Replies
fchataigner2
Dataiker

Hi,

you'll need to expose the datasets of the other projects to the project where you're running this pyspark notebook, so that DSS can know which projects' info it needs to package for Spark.

0 Kudos
nmadhu20
Author

Hey,

Thanks for your reply.

We cant expose the datasets in  flow as it is supposed to be a central repository governing all projects and datasets. So total number of datasets becomes very huge.

0 Kudos