running pyspark locally with pycharm/vscode and pyspark recipe
I am able to run python recipe , installed the dataiku package 5.1.0 as given in docs. All is well there
Now I wanted to run pyspark recipe and this is what happens
from dataiku import spark as dkuspark
ImportError: cannot import name 'spark'
Checked the download and I don't see the spark anywhere
Best Answers
-
I had this problem too. Luckily you can cheat it in by just finding the dataiku python directory inside the DSS folder on whatever server it's installed in (assuming you have access). You'll have to manually copy the folders across to the python install once you've pip installed their package.
This will let VSCode or PyCharm or whatever IDE you choose recognise the Spark bindings at the very least.
-
Yeah, if you want to be able to run stuff using their bindings, then you'll have to create a mock dataiku service which you'll inject at runtime.
Failing that, you can just ignore their API calls for Spark (dkuspark) and just do native parquet reads by using the Datasets API. Probably not a bad idea for all things Spark, considering how shoddy their code is written (I took a peek). You could try starting with the code below:import dataiku hdfs_base = /path/to/spark/parquets def get_path(dataset): return hdfs_base + dataset.full_name.replace(".", "/") my_dataset = dataiku.Dataset("my_dataset") dataset_df = spark.read.parquet(get_path(my_dataset))
I often have to do this anyway, because the dkuspark API fails with certain data types, or handles partitioning incorrectly.
Answers
-
Hi,
You would need to install the pyspark package locally in your local Python. However, we would strongly advise against this:
* You would not get a properly working and configured Spark
* The "dataiku.spark" API which provides transparent access to datasets relies on a lot of machinery that is not in Python and that can only work in DSS.
In other words, for PySpark recipes, you should only perform the edition locally in your IDE and perform the run directly in DSS.
-
But finally my code will run on a spark cluster like databricks or emr or something else and not DSS.
How will that pan out? Am I locked into dataiku runtime for running pyspark then ?
I want to leverage visual pyspark recipes to develop the flow
flow -> ml pipelines -> any spark cluster
-
I have installed the dataiku client 5.1.0 {downloaded from our server}
- In see pandasutils.py present to "from dataiku import pandasutils as pdu" works in python recipe
- I dont see a spark file anywhere, even after search so "from dataiku import spark as dkuspark" will never work
or should I be getting a more recent client version ?
Our dataiku server was recently upgraded from 5 to 6 so I was expecting a client version 6 but got 5.1.0 when I used this dataiku-package to get the package
Am using vscode with dataiku plugin
-
wow...it worked...just saw their spark binding inside that folder. any copied all from their to Lib/site-packages and it works.
thanks -
well, it was able to import but spark context still fails to create
Added a few spark jars etc to the spark-submit command....but no avail
at last stuck t 'No ticket in environment, cannot continue' but atleast moved forward
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
FYI. The NYC Dataiku User Group will be holding a group meeting at which we will be discussing VS Code and Dataiku working together. Given your interest in this subject, we would love to have you join us for that conversation on Wednesday 4/7 at 11:00 am EDT. For further information and to RSVP click here.