Databricks as pyspark engine

aburkh · November 6

Currently, Dataiku only supports Databricks using Databricks Connect (in a python recipe), but does not support pyspark recipes or selecting spark engine for visual recipes.

As a result, we are not able to develop visual recipes with spark, and either need to use Databricks connect in python recipes, or develop directly in Databricks and trigger the execution of notebooks through Rest APIs.

Would it be possible to enable Databricks as spark engine in Dataiku?

Turribeach · November 6

Visual recipes will never translate to Spark. You can however use Visual Recipes and push them down to a Databricks SQL Warehouse and Databricks will execute them in SparkSQL using the same data you have in your Databricks Unity Catalog. Not sure why you say you can't use Databricks connect in Python recipes. If you want to develop directly in Databricks then what's the point of using Dataiku?

A Product Idea is probably not the right to discuss the Databricks integration issues you think you have so I suggest you start separate posts in the Using Dataiku section.

aburkh · November 7

https://community.dataiku.com/discussion/comment/44849#Comment_44849

I'm confused, doesn't the documentation specifically describes visual prepare recipes executed with a spark engine? Execution engines — Dataiku DSS 13 documentation

We do not want to develop in databricks, we would prefer to have the possibility to use Dataiku visual recipes and sometimes spark recipes. But currently, Databricks is not fully supported, hence we have to resort to writing code manually.

Dataiku visual recipes are unable to translate multiple processors to SQL, such as Fold ("Fold" processors in visual recipe - Implement In-Database engine — Dataiku Community), even though the databases fully support those capabilities. As per the documentation, this is what is the list of processors that are supposed to be supported by each engine for visual prepare recipe. Execution engines — Dataiku DSS 13 documentation

Can you confirm if my understanding is correct?

Processor	DSS	In Database (SQL)	Spark
Extract from array	YES	NO	YES
Fold an array	YES	NO	YES
Sort array	YES	NO	YES
Concatenate JSON arrays	YES	NO	YES
Discretize (bin) numerical values	YES	Partially	YES
Change coordinates system	YES	Partially	YES
Copy column	YES	YES	YES
Rename columns	YES	YES	YES
Concatenate columns	YES	YES	YES
Delete/Keep columns by name	YES	YES	YES
Column Pseudonymization	YES	NO	YES
Count occurrences	YES	NO	YES
Convert currencies	YES	Snowflake only	YES
Split currencies in column	YES	Snowflake only	YES
Create if, then, else statements	YES	YES	YES
Extract date elements	YES	Partially	YES
Compute difference between dates	YES	Partially	YES
Format date with custom format	YES	Partially	YES
Parse to standard date format	YES	Partially	YES
Split e-mail addresses	YES	Snowflake only	YES
Enrich from French department	YES	NO	YES
Enrich from French postcode	YES	NO	YES
Enrich with build context	YES	NO	YES
Enrich with record context	YES	NO	YES
Extract ngrams	YES	NO	YES
Extract numbers	YES	Snowflake only	YES
Fill column	YES	NO	YES
Fill empty cells with fixed value	YES	YES	YES
Impute with computed value	YES	NO	YES
Filter rows/cells on date	YES	Partially	YES
Filter rows/cells with formula	YES	Partially	YES
Filter invalid rows/cells	YES	Partially	YES
Filter rows/cells on numerical range	YES	YES	YES
Filter rows/cells on value	YES	YES	YES
Find and replace	YES	Partially	YES
Flag rows/cells on date range	YES	Partially	YES
Flag rows with formula	YES	Partially	YES
Flag invalid rows	YES	Partially	YES
Flag rows on numerical range	YES	YES	YES
Flag rows on value	YES	YES	YES
Fold multiple columns	YES	NO	YES
Fold multiple columns by pattern	YES	NO	YES
Fold object keys	YES	NO	YES
Formula	YES	Partially	YES
Fuzzy join with other dataset (memory-based)	YES	NO	YES
Generate Big Data	YES	NO	YES
Compute distance between geospatial objects	YES	Partially	YES
Extract from geo column	YES	Partially	YES
Geo-join	YES	NO	YES
Resolve GeoIP	YES	Snowflake only	YES
Create area around a geopoint	YES	NO	YES
Create GeoPoint from lat/lon	YES	NO	YES
Extract lat/lon from GeoPoint	YES	NO	YES
Extract with grok	YES	NO	YES
Flag holidays	YES	Snowflake only	YES
Split invalid cells into another column	YES	NO	YES
Join with other dataset (memory-based)	YES	NO	YES
Extract with JSONPath	YES	NO	YES
Group long-tail values	YES	NO	YES
Compute the average of numerical values	YES	NO	YES
Translate values using meaning	YES	NO	YES
Normalize measure	YES	Snowflake only	YES
Merge long-tail values	YES	NO	YES
Move columns	YES	YES	YES
Negate boolean value	YES	NO	YES
Force numerical range	YES	NO	YES
Generate numerical combinations	YES	NO	YES
Convert number formats	YES	NO	YES
Nest columns	YES	NO	YES
Unnest object (flatten JSON)	YES	NO	YES
Extract with regular expression	YES	Partially	YES
Pivot	YES	NO	YES
Python function	YES	NO	YES
Split HTTP Query String	YES	Snowflake only	YES
Remove rows where cell is empty	YES	YES	YES
Round numbers	YES	NO	YES
Simplify text	YES	Snowflake only	YES
Split and fold	YES	NO	YES
Split into chunks	YES	NO	YES
Split and unfold	YES	YES	YES
Split column	YES	YES	YES
Switch case	YES	NO	YES
Transform string	YES	Partially	YES
Tokenize text	YES	NO	YES
Transpose rows to columns	YES	NO	YES
Triggered unfold	YES	NO	YES
Unfold	YES	YES	YES
Unfold an array	YES	NO	YES
Convert a UNIX timestamp to a date	YES	Snowflake only	YES
Fill empty cells with previous/next value	YES	NO	YES
Split URL (into protocol, host, port, …)	YES	Snowflake only	YES
Classify User-Agent	YES	Snowflake only	YES
Generate a best-effort visitor id	YES	Snowflake only	YES
Zip JSON arrays	YES	NO	YES

Turribeach · November 8

I stand corrected, Visual recipes can translate to Spark but this is a regular Spark cluster not a Databricks compute cluster. As you have noted some processors are not available in SQL but as you have also said these can be done in SQL so in those cases you can use a SQL recipe to do it. A lot of Visual recipes also allow you to use SQL expressions which means you don't need to fall back to full SQL recipe. It is perfectly possible to use Dataiku Visual and Code recipes against Databricks SQL Warehouse and Dataiku Computer Cluster with all your data being in Databricks and never loaded in memory in the DSS server. But you have to stick to Visual Recipes using SQL engine and Databricks connect for Python recipes.

aburkh · November 8

Yes, exactly. Currently, Dataiku has a Databricks integration that works 80% of the time when the spark cluster works 100%.

It would be great to have better Databricks integration, either through better SQL support or better spark support, so that we don't have to resort to workaround such as manually coding transformations (SQL recipes or databricks connect)

Databricks as pyspark engine

New · Last Updated November 6

Comments

Categories

Setup Info

Tags