Spark module missing
I am trying to run recipes outside dataiku using the internal dataiku client. However, i am not able to import the spark module, i.e. this line throws an import error,
from dataiku import spark as dkuspark
I have already installed the dataiku client, and other functionality (e.g. loading projects etc.) works as intended. Looking at the source code of the dataiku client, it seems that the spark module is missing.
Am I doing something wrong? Should the spark module be downloaded separately?
EDIT: I guess [this](https://doc.dataiku.com/dss/latest/python-api/pyspark.html) is the module that I am missing.
Answers
-
Hi,
Unfortunately, it is not possible to execute Spark recipes outside of DSS. Working with Spark requires a very tight integration between "where the recipe runs" and the cluster, which is not possible to externalize.
You'll need to save back your recipe to DSS and run it from there.
-
Thanks for the rapid response! But I am sorry to hear that. I am trying to load data into my test pipeline (which is running on Azure), and I would like the integration with dataiku to be as tight as possible (so that the test env mimics the prod env as closely as possible).
Is it possible to get the source code to the module somewhere? It might be able to figure out a solution myself, but it would be easier, if I knew what the module was doing in the first place.
The last resport would be to load the raw data myself for the Azure storage blob.
EDIT: I have now implemented a function, which seems to work. It can probably be improved, but I figured I would share it here for inspiration to others facing the same issue.
# Create dummy spark session unless provided.
if spark_session is None:
spark_session = SparkSession \
.builder \
.appName("Python Spark Mock Data Binding") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Try loading variables from environment variables if not provided.
dku_project = dku_project if dku_project is not None else os.environ['DKU_PROJECT']
# Get the project.
client = dataiku.api_client()
project = client.get_project(dku_project)
# Get the dataset.
dataset = project.get_dataset(name)
if partitions is not None:
dataset.add_read_partitions(partitions)
core_dataset = dataset.get_as_core_dataset()
df = core_dataset.get_dataframe()
# Convert into a spark data frame.
df_spark = spark_session.createDataFrame(df)