Writing a dataset in delta lake format
Is it possible to write a dataset into s3 into a delta lake format file? It is not clear from this link (https://doc.dataiku.com/dss/latest/connecting/formats/deltalake.html) if only reading delta lake format is supported and not writing.
I tried to use a sync recipe or a pyspark recipe (that simply copied input to output) and both jobs failed. Trying to understand if its simply unsupported or if I have a configuration issue.
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @yashpuranik
,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html
Answers
-
yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron
Thanks @AlexT
. To confirm, even with this approach, I cannot write a dataset one-off into S3 in the delta format? I can only use this JDBC connector if I have a Databricks environment set up?To give more context, I am working with some large datasets (~TBs). Each dataset arrives as a collection of gz csv files dropped in S3 buckets that are disproportionate (some files are ~100GB, others are ~KB). I am trying to determine what is the best way of organizing the datasets and querying them efficiently.
The options I am exploring are:
1. Converting the dataset into parquet and querying using Spark engine
2. Converting the dataset into delta lake format and using the databricks executor engine on S3
I don't have a database system such as Snowflake available. I am thinking I need to find the most optimal combination of data format and compute engine. (Reading the dataset as Dataiku partitioned dataset is not a logically available option based on how the data is structured).
Do you have any other ideas?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Option 1) sounds suitable the you can read delta with DSS on S3 you can write Parquet to S3