running pyspark locally with pycharm/vscode and pyspark recipe

afalak
afalak Registered Posts: 6 ✭✭✭✭

I am able to run python recipe , installed the dataiku package 5.1.0 as given in docs. All is well there

Now I wanted to run pyspark recipe and this is what happens

from dataiku import spark as dkuspark
ImportError: cannot import name 'spark'

Checked the download and I don't see the spark anywhere

Best Answers

  • jmac
    jmac Registered Posts: 3 ✭✭✭✭
    Answer ✓

    I had this problem too. Luckily you can cheat it in by just finding the dataiku python directory inside the DSS folder on whatever server it's installed in (assuming you have access). You'll have to manually copy the folders across to the python install once you've pip installed their package.

    This will let VSCode or PyCharm or whatever IDE you choose recognise the Spark bindings at the very least.

  • jmac
    jmac Registered Posts: 3 ✭✭✭✭
    edited July 17 Answer ✓

    Yeah, if you want to be able to run stuff using their bindings, then you'll have to create a mock dataiku service which you'll inject at runtime.

    Failing that, you can just ignore their API calls for Spark (dkuspark) and just do native parquet reads by using the Datasets API. Probably not a bad idea for all things Spark, considering how shoddy their code is written (I took a peek). You could try starting with the code below:

    import dataiku
    
    hdfs_base = /path/to/spark/parquets
    
    def get_path(dataset):
        return hdfs_base + dataset.full_name.replace(".", "/")
    
    my_dataset = dataiku.Dataset("my_dataset")
    dataset_df = spark.read.parquet(get_path(my_dataset))

    I often have to do this anyway, because the dkuspark API fails with certain data types, or handles partitioning incorrectly.

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker

    Hi,

    You would need to install the pyspark package locally in your local Python. However, we would strongly advise against this:

    * You would not get a properly working and configured Spark

    * The "dataiku.spark" API which provides transparent access to datasets relies on a lot of machinery that is not in Python and that can only work in DSS.

    In other words, for PySpark recipes, you should only perform the edition locally in your IDE and perform the run directly in DSS.

  • afalak
    afalak Registered Posts: 6 ✭✭✭✭

    But finally my code will run on a spark cluster like databricks or emr or something else and not DSS.

    How will that pan out? Am I locked into dataiku runtime for running pyspark then ?

    I want to leverage visual pyspark recipes to develop the flow

    flow -> ml pipelines -> any spark cluster

  • afalak
    afalak Registered Posts: 6 ✭✭✭✭

    I have installed the dataiku client 5.1.0 {downloaded from our server}

    1. In see pandasutils.py present to "from dataiku import pandasutils as pdu" works in python recipe
    2. I dont see a spark file anywhere, even after search so "from dataiku import spark as dkuspark" will never work

    or should I be getting a more recent client version ?

    Our dataiku server was recently upgraded from 5 to 6 so I was expecting a client version 6 but got 5.1.0 when I used this dataiku-package to get the package

    Am using vscode with dataiku plugin

  • afalak
    afalak Registered Posts: 6 ✭✭✭✭
    wow...it worked...just saw their spark binding inside that folder. any copied all from their to Lib/site-packages and it works.

    thanks
  • afalak
    afalak Registered Posts: 6 ✭✭✭✭

    well, it was able to import but spark context still fails to create

    Added a few spark jars etc to the spark-submit command....but no avail

    at last stuck t 'No ticket in environment, cannot continue' but atleast moved forward

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @jmac
    @afalak
    ,

    FYI. The NYC Dataiku User Group will be holding a group meeting at which we will be discussing VS Code and Dataiku working together. Given your interest in this subject, we would love to have you join us for that conversation on Wednesday 4/7 at 11:00 am EDT. For further information and to RSVP click here.

Setup Info
    Tags
      Help me…