Writing a dataset in delta lake format

Solved!
yashpuranik
Writing a dataset in delta lake format

Is it possible to write a dataset into s3 into a delta lake format file? It is not clear from this link (https://doc.dataiku.com/dss/latest/connecting/formats/deltalake.html) if only reading delta lake format is supported and not writing.

I tried to use a sync recipe or a pyspark recipe (that simply copied input to output) and both jobs failed. Trying to understand if its simply unsupported or if I have a configuration issue.

yashpuranik
0 Kudos
1 Solution
AlexT
Dataiker

Hi @yashpuranik ,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker

Hi @yashpuranik ,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html

0 Kudos
yashpuranik
Author

Thanks @AlexT. To confirm, even with this approach, I cannot write a dataset one-off into S3 in the delta format? I can only use this JDBC connector if I have a Databricks environment set up?

To give more context, I am working with some large datasets (~TBs). Each dataset arrives as a collection of gz csv files dropped in S3 buckets that are disproportionate (some files are ~100GB, others are ~KB). I am trying to determine what is the best way of organizing the datasets and querying them efficiently.

The options I am exploring are:

1. Converting the dataset into parquet and querying using Spark engine

2. Converting the dataset into delta lake format and using the databricks executor engine on S3

I don't have a database system such as Snowflake available. I am thinking I need to find the most optimal combination of data format and compute engine. (Reading the dataset as Dataiku partitioned dataset is not a logically available option based on how the data is structured).

 

Do you have any other ideas?

 

 

 

yashpuranik
0 Kudos
AlexT
Dataiker

Option 1) sounds suitable the  you can read delta with DSS on S3  you can write  Parquet to S3

0 Kudos