Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Team,
I need to connect with different projects in a loop and read required datasets using pyspark.
But once I have created the sqlContext, it maps to current project and does not point to the required project which throws an error at 'dataiku.spark.get_dataframe()'. It works fine if I hardcode the name of current project and any dataset in it.
Adding a sample code snippet below and the error for your clarity.
Code:
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
for project in prj_dict():
client = dataiku.api_client()
#connect with required project
dataiku.set_default_project_key(project)
#read the required dataset
dataset_input = dataiku.Dataset(prj_dict[project])
dat_input = dataiku.spark.get_dataframe(sqlContext, dataset_input)
Error :
Py4JJavaError: An error occurred while calling o339.getPyDataFrame.
: java.lang.Error: No variables context for project PROJECT
at com.dataiku.dip.variables.ManualVariablesService.getContext(ManualVariablesService.java:30)
at com.dataiku.dip.variables.VariablesUtils.expand(VariablesUtils.java:17)
at com.dataiku.dip.datasets.fs.BlobLikeDatasetHandler.<init>(BlobLikeDatasetHandler.java:65)
at com.dataiku.dip.datasets.fs.S3DatasetHandler.<init>(S3DatasetHandler.java:72)
at com.dataiku.dip.datasets.fs.BuiltinFSDatasets$3.build(BuiltinFSDatasets.java:315)
at com.dataiku.dip.input.DatasetHandlerFactory.build(DatasetHandlerFactory.java:54)
at com.dataiku.dip.spark.StorageBackendsSupport$.isS3DatasetCompatible(StorageBackendsSupport.scala:81)
at com.dataiku.dip.spark.FastPathHandler.get(FastPathHandler.scala:53)
at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrameInternal(StdDataikuSparkContext.scala:278)
at com.dataiku.dip.spark.StdDataikuSparkContext.getAsSampledDataFrame(StdDataikuSparkContext.scala:155)
at com.dataiku.dip.spark.StdDataikuSparkContext.getPyDataFrame(StdDataikuSparkContext.scala:521)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hi,
you'll need to expose the datasets of the other projects to the project where you're running this pyspark notebook, so that DSS can know which projects' info it needs to package for Spark.
Hey,
Thanks for your reply.
We cant expose the datasets in flow as it is supposed to be a central repository governing all projects and datasets. So total number of datasets becomes very huge.