Writing a dataset in delta lake format

yashpuranik · June 2023

Is it possible to write a dataset into s3 into a delta lake format file? It is not clear from this link (https://doc.dataiku.com/dss/latest/connecting/formats/deltalake.html) if only reading delta lake format is supported and not writing.

I tried to use a sync recipe or a pyspark recipe (that simply copied input to output) and both jobs failed. Trying to understand if its simply unsupported or if I have a configuration issue.

Alexandru · June 2023

Hi @yashpuranik
,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html

yashpuranik · June 2023

Thanks @AlexT
. To confirm, even with this approach, I cannot write a dataset one-off into S3 in the delta format? I can only use this JDBC connector if I have a Databricks environment set up?

To give more context, I am working with some large datasets (~TBs). Each dataset arrives as a collection of gz csv files dropped in S3 buckets that are disproportionate (some files are ~100GB, others are ~KB). I am trying to determine what is the best way of organizing the datasets and querying them efficiently.

The options I am exploring are:

1. Converting the dataset into parquet and querying using Spark engine

2. Converting the dataset into delta lake format and using the databricks executor engine on S3

I don't have a database system such as Snowflake available. I am thinking I need to find the most optimal combination of data format and compute engine. (Reading the dataset as Dataiku partitioned dataset is not a logically available option based on how the data is structured).

Do you have any other ideas?

Alexandru · June 2023

Option 1) sounds suitable the you can read delta with DSS on S3 you can write Parquet to S3

Writing a dataset in delta lake format

Best Answer

Answers

Categories

Setup Info

Tags