running pyspark locally with pycharm/vscode and pyspark recipe

Solved!
afalak
Level 2
running pyspark locally with pycharm/vscode and pyspark recipe

I am able to run python recipe , installed the dataiku package 5.1.0 as given in docs. All is well there

Now I wanted to run pyspark recipe and this is what happens

from dataiku import spark as dkuspark
ImportError: cannot import name 'spark'

 

Checked the download and I don't see the spark anywhere

 

0 Kudos
2 Solutions
jmac
Level 2

I had this problem too. Luckily you can cheat it in by just finding the dataiku python directory inside the DSS folder on whatever server it's installed in (assuming you have access). You'll have to manually copy the folders across to the python install once you've pip installed their package.

This will let VSCode or PyCharm or whatever IDE you choose recognise the Spark bindings at the very least.

View solution in original post

jmac
Level 2

Yeah, if you want to be able to run stuff using their bindings, then you'll have to create a mock dataiku service which you'll inject at runtime.

Failing that, you can just ignore their API calls for Spark (dkuspark) and just do native parquet reads by using the Datasets API. Probably not a bad idea for all things Spark, considering how shoddy their code is written (I took a peek). You could try starting with the code below:

import dataiku

hdfs_base = /path/to/spark/parquets

def get_path(dataset):
    return hdfs_base + dataset.full_name.replace(".", "/")

my_dataset = dataiku.Dataset("my_dataset")
dataset_df = spark.read.parquet(get_path(my_dataset))

I often have to do this anyway, because the dkuspark API fails with certain data types, or handles partitioning incorrectly.

 

View solution in original post

8 Replies
Clément_Stenac
Dataiker

Hi,

You would need to install the pyspark package locally in your local Python. However, we would strongly advise against this:

* You would not get a properly working and configured Spark

* The "dataiku.spark" API which provides transparent access to datasets relies on a lot of machinery that is not in Python and that can only work in DSS.

In other words, for PySpark recipes, you should only perform the edition locally in your IDE and perform the run directly in DSS.

afalak
Level 2
Author

But finally my code will run on a spark cluster like databricks or emr or something else and not DSS.

How will that pan out? Am I locked into dataiku runtime for running pyspark then ?

I want to leverage visual pyspark recipes to develop the flow

flow -> ml pipelines -> any spark cluster

 

0 Kudos
afalak
Level 2
Author

I have installed the dataiku client 5.1.0 {downloaded from our server}

  1. In see pandasutils.py present to "from dataiku import pandasutils as pdu" works in python recipe
  2. I dont see a spark file anywhere, even after search so "from dataiku import spark as dkuspark" will never work

or should I be getting a more recent client version ?

Our dataiku server was recently upgraded from 5 to 6 so I was expecting a client version 6 but got 5.1.0 when I used this dataiku-package  to get the package

Am using vscode with dataiku plugin

 

0 Kudos
jmac
Level 2

I had this problem too. Luckily you can cheat it in by just finding the dataiku python directory inside the DSS folder on whatever server it's installed in (assuming you have access). You'll have to manually copy the folders across to the python install once you've pip installed their package.

This will let VSCode or PyCharm or whatever IDE you choose recognise the Spark bindings at the very least.

afalak
Level 2
Author
wow...it worked...just saw their spark binding inside that folder. any copied all from their to Lib/site-packages and it works.

thanks
0 Kudos
afalak
Level 2
Author

well, it was able to import but spark context still fails to create

Added a few spark jars etc to the spark-submit command....but no avail

 

at last stuck t 'No ticket in environment, cannot continue' but atleast moved forward

0 Kudos
jmac
Level 2

Yeah, if you want to be able to run stuff using their bindings, then you'll have to create a mock dataiku service which you'll inject at runtime.

Failing that, you can just ignore their API calls for Spark (dkuspark) and just do native parquet reads by using the Datasets API. Probably not a bad idea for all things Spark, considering how shoddy their code is written (I took a peek). You could try starting with the code below:

import dataiku

hdfs_base = /path/to/spark/parquets

def get_path(dataset):
    return hdfs_base + dataset.full_name.replace(".", "/")

my_dataset = dataiku.Dataset("my_dataset")
dataset_df = spark.read.parquet(get_path(my_dataset))

I often have to do this anyway, because the dkuspark API fails with certain data types, or handles partitioning incorrectly.

 

tgb417

@jmac @afalak ,

FYI.  The NYC Dataiku User Group will be holding a group meeting at which we will be discussing VS Code and Dataiku working together.  Given your interest in this subject, we would love to have you join us for that conversation on Wednesday 4/7 at 11:00 am EDT.  For further information and to RSVP click here.

--Tom