Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on April 2, 2025 11:11AM
Likes: 0
Replies: 7
Hi!
I'm currently testing out the possibilities for leveraging Spark in our ETL pipelines. My usecase is that I have 90% of cases where I start with just raw text files in Azure Blob Storage (usually CSV's or TXT's). How can I plug in Spark to read and process those files? If I select the Spark engine or run a custom Spark recipe, I get each time an error of WARNINGWARN_SPARK_NON_DISTRIBUTED_READ: non-empty header in CSV not supported
. Is there any workaround for this?
All the best,
TP
Operating system used: Win 11
Operating system used: Win 11
So is it possible to integrate DSS and Databricks, pushing down compute to Databricks but still reading and writing datasets using DSS' connections? ⇒ Absolutely!
There are some caveats of course mostly around Prepare recipe processors and special coding required to use Python recipes with pushdown into Databricks with DBConnect. But works very well I would say. So all you need is to create the Databricks JDBC connection in Dataiku and you can start tapping into those huge datasets. You don't even need to install the Databricks JDBC driver as Dataiku already privides that.
What is your actual requirement? Better performance?
Hi Turribeach! Thanks a lot for a quick answer :).
Yes. Basically I have some datasets that are 50Hz timeseries data (so 50 rows per second), which equates to around 180GB of data for a month. I'm trying to establish if processing such datasets in Dataiku is even possible.
That amount of data will not be usually suitable for DSS in memory loading which means you should be looking at push down approaches (as in push down the data compute into another platform). Spark is certainly an option but is really an outdated approach. Using Spark requires you to manage a Spark cluster or setup Kubernetes and code in Spark, both of which are usually undesireable as they complex and require upskilling and ongoing maintenance. My advice will be to look at modern database technologies that can handle large data volumes with ease like Databricks, Snowflake or GCP's BigQuery. The first two are cloud agnostic so should be available to you in whatever cloud you are. All of these will use SQL as their engine meaning you can leverage your user's base SQL skills while tapping into scalable compute platforms.
Yeah, so we currently have our legacy solution that uses Databricks to process data. We started to look into Dataiku cause it has some benefits in our case (mainly the users can easily use DSS to perform the transformations).
So is it possible to integrate DSS and Databricks, pushing down compute to Databricks but still reading and writing datasets using DSS' connections? Like in my example, can I compute the middle part (dark-red recipe) in Databricks?
Hi @Turribeach, coming back to this, cause I'm somewhat stuck. I created the connection to Databricks and it works upon testing it, so everything should be fine. But then after I run this in a notebook:
import dataiku
from dataiku.dbconnect import DkuDBConnect
dbc = DkuDBConnect()
src = dataiku.Dataset("src_month")
df_trips = dbc.get_dataframe(dataset=src)
Then I get this error:ValueError: The connection is not a Databricks connection
This is exactly as documented here (Databricks — Dataiku DSS 13 documentation), but still I think I'm missing something. Is this the right way to copy the data to Databricks and pushdown the compute to the cluster? And shouldn't I actually specify the connection name somewhere?
Thanks!
Please start a new thread as this is a different issue that’s your original post. If we continue with different issues on a thread it becomes a mess.