Databricks as pyspark engine

aburkh
aburkh Registered Posts: 9 ✭✭✭

Currently, Dataiku only supports Databricks using Databricks Connect (in a python recipe), but does not support pyspark recipes or selecting spark engine for visual recipes.

As a result, we are not able to develop visual recipes with spark, and either need to use Databricks connect in python recipes, or develop directly in Databricks and trigger the execution of notebooks through Rest APIs.

Would it be possible to enable Databricks as spark engine in Dataiku?

3
3 votes

New · Last Updated

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    Visual recipes will never translate to Spark. You can however use Visual Recipes and push them down to a Databricks SQL Warehouse and Databricks will execute them in SparkSQL using the same data you have in your Databricks Unity Catalog. Not sure why you say you can't use Databricks connect in Python recipes. If you want to develop directly in Databricks then what's the point of using Dataiku?

    A Product Idea is probably not the right to discuss the Databricks integration issues you think you have so I suggest you start separate posts in the Using Dataiku section.

  • aburkh
    aburkh Registered Posts: 9 ✭✭✭

    I'm confused, doesn't the documentation specifically describes visual prepare recipes executed with a spark engine? Execution engines — Dataiku DSS 13 documentation

    We do not want to develop in databricks, we would prefer to have the possibility to use Dataiku visual recipes and sometimes spark recipes. But currently, Databricks is not fully supported, hence we have to resort to writing code manually.

    Dataiku visual recipes are unable to translate multiple processors to SQL, such as Fold ("Fold" processors in visual recipe - Implement In-Database engine — Dataiku Community), even though the databases fully support those capabilities. As per the documentation, this is what is the list of processors that are supposed to be supported by each engine for visual prepare recipe. Execution engines — Dataiku DSS 13 documentation

    Can you confirm if my understanding is correct?

    Processor

    DSS

    In Database (SQL)

    Spark

    Extract from array

    YES

    NO

    YES

    Fold an array

    YES

    NO

    YES

    Sort array

    YES

    NO

    YES

    Concatenate JSON arrays

    YES

    NO

    YES

    Discretize (bin) numerical values

    YES

    Partially

    YES

    Change coordinates system

    YES

    Partially

    YES

    Copy column

    YES

    YES

    YES

    Rename columns

    YES

    YES

    YES

    Concatenate columns

    YES

    YES

    YES

    Delete/Keep columns by name

    YES

    YES

    YES

    Column Pseudonymization

    YES

    NO

    YES

    Count occurrences

    YES

    NO

    YES

    Convert currencies

    YES

    Snowflake only

    YES

    Split currencies in column

    YES

    Snowflake only

    YES

    Create if, then, else statements

    YES

    YES

    YES

    Extract date elements

    YES

    Partially

    YES

    Compute difference between dates

    YES

    Partially

    YES

    Format date with custom format

    YES

    Partially

    YES

    Parse to standard date format

    YES

    Partially

    YES

    Split e-mail addresses

    YES

    Snowflake only

    YES

    Enrich from French department

    YES

    NO

    YES

    Enrich from French postcode

    YES

    NO

    YES

    Enrich with build context

    YES

    NO

    YES

    Enrich with record context

    YES

    NO

    YES

    Extract ngrams

    YES

    NO

    YES

    Extract numbers

    YES

    Snowflake only

    YES

    Fill column

    YES

    NO

    YES

    Fill empty cells with fixed value

    YES

    YES

    YES

    Impute with computed value

    YES

    NO

    YES

    Filter rows/cells on date

    YES

    Partially

    YES

    Filter rows/cells with formula

    YES

    Partially

    YES

    Filter invalid rows/cells

    YES

    Partially

    YES

    Filter rows/cells on numerical range

    YES

    YES

    YES

    Filter rows/cells on value

    YES

    YES

    YES

    Find and replace

    YES

    Partially

    YES

    Flag rows/cells on date range

    YES

    Partially

    YES

    Flag rows with formula

    YES

    Partially

    YES

    Flag invalid rows

    YES

    Partially

    YES

    Flag rows on numerical range

    YES

    YES

    YES

    Flag rows on value

    YES

    YES

    YES

    Fold multiple columns

    YES

    NO

    YES

    Fold multiple columns by pattern

    YES

    NO

    YES

    Fold object keys

    YES

    NO

    YES

    Formula

    YES

    Partially

    YES

    Fuzzy join with other dataset (memory-based)

    YES

    NO

    YES

    Generate Big Data

    YES

    NO

    YES

    Compute distance between geospatial objects

    YES

    Partially

    YES

    Extract from geo column

    YES

    Partially

    YES

    Geo-join

    YES

    NO

    YES

    Resolve GeoIP

    YES

    Snowflake only

    YES

    Create area around a geopoint

    YES

    NO

    YES

    Create GeoPoint from lat/lon

    YES

    NO

    YES

    Extract lat/lon from GeoPoint

    YES

    NO

    YES

    Extract with grok

    YES

    NO

    YES

    Flag holidays

    YES

    Snowflake only

    YES

    Split invalid cells into another column

    YES

    NO

    YES

    Join with other dataset (memory-based)

    YES

    NO

    YES

    Extract with JSONPath

    YES

    NO

    YES

    Group long-tail values

    YES

    NO

    YES

    Compute the average of numerical values

    YES

    NO

    YES

    Translate values using meaning

    YES

    NO

    YES

    Normalize measure

    YES

    Snowflake only

    YES

    Merge long-tail values

    YES

    NO

    YES

    Move columns

    YES

    YES

    YES

    Negate boolean value

    YES

    NO

    YES

    Force numerical range

    YES

    NO

    YES

    Generate numerical combinations

    YES

    NO

    YES

    Convert number formats

    YES

    NO

    YES

    Nest columns

    YES

    NO

    YES

    Unnest object (flatten JSON)

    YES

    NO

    YES

    Extract with regular expression

    YES

    Partially

    YES

    Pivot

    YES

    NO

    YES

    Python function

    YES

    NO

    YES

    Split HTTP Query String

    YES

    Snowflake only

    YES

    Remove rows where cell is empty

    YES

    YES

    YES

    Round numbers

    YES

    NO

    YES

    Simplify text

    YES

    Snowflake only

    YES

    Split and fold

    YES

    NO

    YES

    Split into chunks

    YES

    NO

    YES

    Split and unfold

    YES

    YES

    YES

    Split column

    YES

    YES

    YES

    Switch case

    YES

    NO

    YES

    Transform string

    YES

    Partially

    YES

    Tokenize text

    YES

    NO

    YES

    Transpose rows to columns

    YES

    NO

    YES

    Triggered unfold

    YES

    NO

    YES

    Unfold

    YES

    YES

    YES

    Unfold an array

    YES

    NO

    YES

    Convert a UNIX timestamp to a date

    YES

    Snowflake only

    YES

    Fill empty cells with previous/next value

    YES

    NO

    YES

    Split URL (into protocol, host, port, …)

    YES

    Snowflake only

    YES

    Classify User-Agent

    YES

    Snowflake only

    YES

    Generate a best-effort visitor id

    YES

    Snowflake only

    YES

    Zip JSON arrays

    YES

    NO

    YES

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    I stand corrected, Visual recipes can translate to Spark but this is a regular Spark cluster not a Databricks compute cluster. As you have noted some processors are not available in SQL but as you have also said these can be done in SQL so in those cases you can use a SQL recipe to do it. A lot of Visual recipes also allow you to use SQL expressions which means you don't need to fall back to full SQL recipe. It is perfectly possible to use Dataiku Visual and Code recipes against Databricks SQL Warehouse and Dataiku Computer Cluster with all your data being in Databricks and never loaded in memory in the DSS server. But you have to stick to Visual Recipes using SQL engine and Databricks connect for Python recipes.

  • aburkh
    aburkh Registered Posts: 9 ✭✭✭

    Yes, exactly. Currently, Dataiku has a Databricks integration that works 80% of the time when the spark cluster works 100%.

    It would be great to have better Databricks integration, either through better SQL support or better spark support, so that we don't have to resort to workaround such as manually coding transformations (SQL recipes or databricks connect)

Setup Info
    Tags
      Help me…