Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Is it possible to write a dataset into s3 into a delta lake format file? It is not clear from this link (https://doc.dataiku.com/dss/latest/connecting/formats/deltalake.html) if only reading delta lake format is supported and not writing.
I tried to use a sync recipe or a pyspark recipe (that simply copied input to output) and both jobs failed. Trying to understand if its simply unsupported or if I have a configuration issue.
Hi @yashpuranik ,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html
Hi @yashpuranik ,
Writing delta format S3 is not possible. You can only read delta from S3 storage.
To write delta, you need to use the Databricks jdbc connection:
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html
Thanks @AlexT. To confirm, even with this approach, I cannot write a dataset one-off into S3 in the delta format? I can only use this JDBC connector if I have a Databricks environment set up?
To give more context, I am working with some large datasets (~TBs). Each dataset arrives as a collection of gz csv files dropped in S3 buckets that are disproportionate (some files are ~100GB, others are ~KB). I am trying to determine what is the best way of organizing the datasets and querying them efficiently.
The options I am exploring are:
1. Converting the dataset into parquet and querying using Spark engine
2. Converting the dataset into delta lake format and using the databricks executor engine on S3
I don't have a database system such as Snowflake available. I am thinking I need to find the most optimal combination of data format and compute engine. (Reading the dataset as Dataiku partitioned dataset is not a logically available option based on how the data is structured).
Do you have any other ideas?
Option 1) sounds suitable the you can read delta with DSS on S3 you can write Parquet to S3