How to connect pyspark instance to multiple projects?

nmadhu20 · August 2021

Hi Team,

I need to connect with different projects in a loop and read required datasets using pyspark.

But once I have created the sqlContext, it maps to current project and does not point to the required project which throws an error at 'dataiku.spark.get_dataframe()'. It works fine if I hardcode the name of current project and any dataset in it.

Adding a sample code snippet below and the error for your clarity.

Code:

import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)

for project in prj_dict():

client = dataiku.api_client()

#connect with required project
dataiku.set_default_project_key(project)

#read the required dataset
dataset_input = dataiku.Dataset(prj_dict[project])
dat_input = dataiku.spark.get_dataframe(sqlContext, dataset_input)

Error :

Py4JJavaError: An error occurred while calling o339.getPyDataFrame.
: java.lang.Error: No variables context for project PROJECT
    at com.dataiku.dip.variables.ManualVariablesService.getContext(ManualVariablesService.java:30)
    at com.dataiku.dip.variables.VariablesUtils.expand(VariablesUtils.java:17)
    at com.dataiku.dip.datasets.fs.BlobLikeDatasetHandler.<init>(BlobLikeDatasetHandler.java:65)
    at com.dataiku.dip.datasets.fs.S3DatasetHandler.<init>(S3DatasetHandler.java:72)
    at com.dataiku.dip.datasets.fs.BuiltinFSDatasets$3.build(BuiltinFSDatasets.java:315)
    at com.dataiku.dip.input.DatasetHandlerFactory.build(DatasetHandlerFactory.java:54)
    at com.dataiku.dip.spark.StorageBackendsSupport$.isS3DatasetCompatible(StorageBackendsSupport.scala:81)
    at com.dataiku.dip.spark.FastPathHandler.get(FastPathHandler.scala:53)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrameInternal(StdDataikuSparkContext.scala:278)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrame(StdDataikuSparkContext.scala:155)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getPyDataFrame(StdDataikuSparkContext.scala:521)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

fchataigner2 · August 2021

Hi,

you'll need to expose the datasets of the other projects to the project where you're running this pyspark notebook, so that DSS can know which projects' info it needs to package for Spark.

nmadhu20 · August 2021

Hey,

Thanks for your reply.

We cant expose the datasets in flow as it is supposed to be a central repository governing all projects and datasets. So total number of datasets becomes very huge.

How to connect pyspark instance to multiple projects?

Answers

Categories

Setup Info

Tags