How to connect pyspark instance to multiple projects?

nmadhu20
nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron
edited July 2024 in General Discussion

Hi Team,

I need to connect with different projects in a loop and read required datasets using pyspark.

But once I have created the sqlContext, it maps to current project and does not point to the required project which throws an error at 'dataiku.spark.get_dataframe()'. It works fine if I hardcode the name of current project and any dataset in it.

Adding a sample code snippet below and the error for your clarity.

Code:

import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)

for project in prj_dict():

client = dataiku.api_client()

#connect with required project
dataiku.set_default_project_key(project)

#read the required dataset
dataset_input = dataiku.Dataset(prj_dict[project])
dat_input = dataiku.spark.get_dataframe(sqlContext, dataset_input)

Error :

Py4JJavaError: An error occurred while calling o339.getPyDataFrame.
: java.lang.Error: No variables context for project PROJECT
    at com.dataiku.dip.variables.ManualVariablesService.getContext(ManualVariablesService.java:30)
    at com.dataiku.dip.variables.VariablesUtils.expand(VariablesUtils.java:17)
    at com.dataiku.dip.datasets.fs.BlobLikeDatasetHandler.<init>(BlobLikeDatasetHandler.java:65)
    at com.dataiku.dip.datasets.fs.S3DatasetHandler.<init>(S3DatasetHandler.java:72)
    at com.dataiku.dip.datasets.fs.BuiltinFSDatasets$3.build(BuiltinFSDatasets.java:315)
    at com.dataiku.dip.input.DatasetHandlerFactory.build(DatasetHandlerFactory.java:54)
    at com.dataiku.dip.spark.StorageBackendsSupport$.isS3DatasetCompatible(StorageBackendsSupport.scala:81)
    at com.dataiku.dip.spark.FastPathHandler.get(FastPathHandler.scala:53)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrameInternal(StdDataikuSparkContext.scala:278)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrame(StdDataikuSparkContext.scala:155)
    at com.dataiku.dip.spark.StdDataikuSparkContext.getPyDataFrame(StdDataikuSparkContext.scala:521)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    Hi,

    you'll need to expose the datasets of the other projects to the project where you're running this pyspark notebook, so that DSS can know which projects' info it needs to package for Spark.

  • nmadhu20
    nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron

    Hey,

    Thanks for your reply.

    We cant expose the datasets in flow as it is supposed to be a central repository governing all projects and datasets. So total number of datasets becomes very huge.

Setup Info
    Tags
      Help me…