Databricks as pyspark engine
 
            Currently, Dataiku only supports Databricks using Databricks Connect (in a python recipe), but does not support pyspark recipes or selecting spark engine for visual recipes.
As a result, we are not able to develop visual recipes with spark, and either need to use Databricks connect in python recipes, or develop directly in Databricks and trigger the execution of notebooks through Rest APIs.
Would it be possible to enable Databricks as spark engine in Dataiku?
Comments
- 
             Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 NeuronVisual recipes will never translate to Spark. You can however use Visual Recipes and push them down to a Databricks SQL Warehouse and Databricks will execute them in SparkSQL using the same data you have in your Databricks Unity Catalog. Not sure why you say you can't use Databricks connect in Python recipes. If you want to develop directly in Databricks then what's the point of using Dataiku? A Product Idea is probably not the right to discuss the Databricks integration issues you think you have so I suggest you start separate posts in the Using Dataiku section. 
- 
            I'm confused, doesn't the documentation specifically describes visual prepare recipes executed with a spark engine? Execution engines — Dataiku DSS 13 documentation We do not want to develop in databricks, we would prefer to have the possibility to use Dataiku visual recipes and sometimes spark recipes. But currently, Databricks is not fully supported, hence we have to resort to writing code manually. Dataiku visual recipes are unable to translate multiple processors to SQL, such as Fold ("Fold" processors in visual recipe - Implement In-Database engine — Dataiku Community), even though the databases fully support those capabilities. As per the documentation, this is what is the list of processors that are supposed to be supported by each engine for visual prepare recipe. Execution engines — Dataiku DSS 13 documentation Can you confirm if my understanding is correct? Processor DSS In Database (SQL) Spark Extract from array YES NO YES Fold an array YES NO YES Sort array YES NO YES Concatenate JSON arrays YES NO YES Discretize (bin) numerical values YES Partially YES Change coordinates system YES Partially YES Copy column YES YES YES Rename columns YES YES YES Concatenate columns YES YES YES Delete/Keep columns by name YES YES YES Column Pseudonymization YES NO YES Count occurrences YES NO YES Convert currencies YES Snowflake only YES Split currencies in column YES Snowflake only YES Create if, then, else statements YES YES YES Extract date elements YES Partially YES Compute difference between dates YES Partially YES Format date with custom format YES Partially YES Parse to standard date format YES Partially YES Split e-mail addresses YES Snowflake only YES Enrich from French department YES NO YES Enrich from French postcode YES NO YES Enrich with build context YES NO YES Enrich with record context YES NO YES Extract ngrams YES NO YES Extract numbers YES Snowflake only YES Fill column YES NO YES Fill empty cells with fixed value YES YES YES Impute with computed value YES NO YES Filter rows/cells on date YES Partially YES Filter rows/cells with formula YES Partially YES Filter invalid rows/cells YES Partially YES Filter rows/cells on numerical range YES YES YES Filter rows/cells on value YES YES YES Find and replace YES Partially YES Flag rows/cells on date range YES Partially YES Flag rows with formula YES Partially YES Flag invalid rows YES Partially YES Flag rows on numerical range YES YES YES Flag rows on value YES YES YES Fold multiple columns YES NO YES Fold multiple columns by pattern YES NO YES Fold object keys YES NO YES Formula YES Partially YES Fuzzy join with other dataset (memory-based) YES NO YES Generate Big Data YES NO YES Compute distance between geospatial objects YES Partially YES Extract from geo column YES Partially YES Geo-join YES NO YES Resolve GeoIP YES Snowflake only YES Create area around a geopoint YES NO YES Create GeoPoint from lat/lon YES NO YES Extract lat/lon from GeoPoint YES NO YES Extract with grok YES NO YES Flag holidays YES Snowflake only YES Split invalid cells into another column YES NO YES Join with other dataset (memory-based) YES NO YES Extract with JSONPath YES NO YES Group long-tail values YES NO YES Compute the average of numerical values YES NO YES Translate values using meaning YES NO YES Normalize measure YES Snowflake only YES Merge long-tail values YES NO YES Move columns YES YES YES Negate boolean value YES NO YES Force numerical range YES NO YES Generate numerical combinations YES NO YES Convert number formats YES NO YES Nest columns YES NO YES Unnest object (flatten JSON) YES NO YES Extract with regular expression YES Partially YES Pivot YES NO YES Python function YES NO YES Split HTTP Query String YES Snowflake only YES Remove rows where cell is empty YES YES YES Round numbers YES NO YES Simplify text YES Snowflake only YES Split and fold YES NO YES Split into chunks YES NO YES Split and unfold YES YES YES Split column YES YES YES Switch case YES NO YES Transform string YES Partially YES Tokenize text YES NO YES Transpose rows to columns YES NO YES Triggered unfold YES NO YES Unfold YES YES YES Unfold an array YES NO YES Convert a UNIX timestamp to a date YES Snowflake only YES Fill empty cells with previous/next value YES NO YES Split URL (into protocol, host, port, …) YES Snowflake only YES Classify User-Agent YES Snowflake only YES Generate a best-effort visitor id YES Snowflake only YES Zip JSON arrays YES NO YES 
- 
             Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 NeuronI stand corrected, Visual recipes can translate to Spark but this is a regular Spark cluster not a Databricks compute cluster. As you have noted some processors are not available in SQL but as you have also said these can be done in SQL so in those cases you can use a SQL recipe to do it. A lot of Visual recipes also allow you to use SQL expressions which means you don't need to fall back to full SQL recipe. It is perfectly possible to use Dataiku Visual and Code recipes against Databricks SQL Warehouse and Dataiku Computer Cluster with all your data being in Databricks and never loaded in memory in the DSS server. But you have to stick to Visual Recipes using SQL engine and Databricks connect for Python recipes. 
- 
            Yes, exactly. Currently, Dataiku has a Databricks integration that works 80% of the time when the spark cluster works 100%. It would be great to have better Databricks integration, either through better SQL support or better spark support, so that we don't have to resort to workaround such as manually coding transformations (SQL recipes or databricks connect) 
