Databricks as pyspark engine
Currently, Dataiku only supports Databricks using Databricks Connect (in a python recipe), but does not support pyspark recipes or selecting spark engine for visual recipes.
As a result, we are not able to develop visual recipes with spark, and either need to use Databricks connect in python recipes, or develop directly in Databricks and trigger the execution of notebooks through Rest APIs.
Would it be possible to enable Databricks as spark engine in Dataiku?
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Visual recipes will never translate to Spark. You can however use Visual Recipes and push them down to a Databricks SQL Warehouse and Databricks will execute them in SparkSQL using the same data you have in your Databricks Unity Catalog. Not sure why you say you can't use Databricks connect in Python recipes. If you want to develop directly in Databricks then what's the point of using Dataiku?
A Product Idea is probably not the right to discuss the Databricks integration issues you think you have so I suggest you start separate posts in the Using Dataiku section.
-
I'm confused, doesn't the documentation specifically describes visual prepare recipes executed with a spark engine? Execution engines — Dataiku DSS 13 documentation
We do not want to develop in databricks, we would prefer to have the possibility to use Dataiku visual recipes and sometimes spark recipes. But currently, Databricks is not fully supported, hence we have to resort to writing code manually.
Dataiku visual recipes are unable to translate multiple processors to SQL, such as Fold ("Fold" processors in visual recipe - Implement In-Database engine — Dataiku Community), even though the databases fully support those capabilities. As per the documentation, this is what is the list of processors that are supposed to be supported by each engine for visual prepare recipe. Execution engines — Dataiku DSS 13 documentation
Can you confirm if my understanding is correct?
Processor
DSS
In Database (SQL)
Spark
Extract from array
YES
NO
YES
Fold an array
YES
NO
YES
Sort array
YES
NO
YES
Concatenate JSON arrays
YES
NO
YES
Discretize (bin) numerical values
YES
Partially
YES
Change coordinates system
YES
Partially
YES
Copy column
YES
YES
YES
Rename columns
YES
YES
YES
Concatenate columns
YES
YES
YES
Delete/Keep columns by name
YES
YES
YES
Column Pseudonymization
YES
NO
YES
Count occurrences
YES
NO
YES
Convert currencies
YES
Snowflake only
YES
Split currencies in column
YES
Snowflake only
YES
Create if, then, else statements
YES
YES
YES
Extract date elements
YES
Partially
YES
Compute difference between dates
YES
Partially
YES
Format date with custom format
YES
Partially
YES
Parse to standard date format
YES
Partially
YES
Split e-mail addresses
YES
Snowflake only
YES
Enrich from French department
YES
NO
YES
Enrich from French postcode
YES
NO
YES
Enrich with build context
YES
NO
YES
Enrich with record context
YES
NO
YES
Extract ngrams
YES
NO
YES
Extract numbers
YES
Snowflake only
YES
Fill column
YES
NO
YES
Fill empty cells with fixed value
YES
YES
YES
Impute with computed value
YES
NO
YES
Filter rows/cells on date
YES
Partially
YES
Filter rows/cells with formula
YES
Partially
YES
Filter invalid rows/cells
YES
Partially
YES
Filter rows/cells on numerical range
YES
YES
YES
Filter rows/cells on value
YES
YES
YES
Find and replace
YES
Partially
YES
Flag rows/cells on date range
YES
Partially
YES
Flag rows with formula
YES
Partially
YES
Flag invalid rows
YES
Partially
YES
Flag rows on numerical range
YES
YES
YES
Flag rows on value
YES
YES
YES
Fold multiple columns
YES
NO
YES
Fold multiple columns by pattern
YES
NO
YES
Fold object keys
YES
NO
YES
Formula
YES
Partially
YES
Fuzzy join with other dataset (memory-based)
YES
NO
YES
Generate Big Data
YES
NO
YES
Compute distance between geospatial objects
YES
Partially
YES
Extract from geo column
YES
Partially
YES
Geo-join
YES
NO
YES
Resolve GeoIP
YES
Snowflake only
YES
Create area around a geopoint
YES
NO
YES
Create GeoPoint from lat/lon
YES
NO
YES
Extract lat/lon from GeoPoint
YES
NO
YES
Extract with grok
YES
NO
YES
Flag holidays
YES
Snowflake only
YES
Split invalid cells into another column
YES
NO
YES
Join with other dataset (memory-based)
YES
NO
YES
Extract with JSONPath
YES
NO
YES
Group long-tail values
YES
NO
YES
Compute the average of numerical values
YES
NO
YES
Translate values using meaning
YES
NO
YES
Normalize measure
YES
Snowflake only
YES
Merge long-tail values
YES
NO
YES
Move columns
YES
YES
YES
Negate boolean value
YES
NO
YES
Force numerical range
YES
NO
YES
Generate numerical combinations
YES
NO
YES
Convert number formats
YES
NO
YES
Nest columns
YES
NO
YES
Unnest object (flatten JSON)
YES
NO
YES
Extract with regular expression
YES
Partially
YES
Pivot
YES
NO
YES
Python function
YES
NO
YES
Split HTTP Query String
YES
Snowflake only
YES
Remove rows where cell is empty
YES
YES
YES
Round numbers
YES
NO
YES
Simplify text
YES
Snowflake only
YES
Split and fold
YES
NO
YES
Split into chunks
YES
NO
YES
Split and unfold
YES
YES
YES
Split column
YES
YES
YES
Switch case
YES
NO
YES
Transform string
YES
Partially
YES
Tokenize text
YES
NO
YES
Transpose rows to columns
YES
NO
YES
Triggered unfold
YES
NO
YES
Unfold
YES
YES
YES
Unfold an array
YES
NO
YES
Convert a UNIX timestamp to a date
YES
Snowflake only
YES
Fill empty cells with previous/next value
YES
NO
YES
Split URL (into protocol, host, port, …)
YES
Snowflake only
YES
Classify User-Agent
YES
Snowflake only
YES
Generate a best-effort visitor id
YES
Snowflake only
YES
Zip JSON arrays
YES
NO
YES
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I stand corrected, Visual recipes can translate to Spark but this is a regular Spark cluster not a Databricks compute cluster. As you have noted some processors are not available in SQL but as you have also said these can be done in SQL so in those cases you can use a SQL recipe to do it. A lot of Visual recipes also allow you to use SQL expressions which means you don't need to fall back to full SQL recipe. It is perfectly possible to use Dataiku Visual and Code recipes against Databricks SQL Warehouse and Dataiku Computer Cluster with all your data being in Databricks and never loaded in memory in the DSS server. But you have to stick to Visual Recipes using SQL engine and Databricks connect for Python recipes.
-
Yes, exactly. Currently, Dataiku has a Databricks integration that works 80% of the time when the spark cluster works 100%.
It would be great to have better Databricks integration, either through better SQL support or better spark support, so that we don't have to resort to workaround such as manually coding transformations (SQL recipes or databricks connect)