Writing a dataset in delta lake format

yashpuranik
yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

Is it possible to write a dataset into s3 into a delta lake format file? It is not clear from this link (https://doc.dataiku.com/dss/latest/connecting/formats/deltalake.html) if only reading delta lake format is supported and not writing.

I tried to use a sync recipe or a pyspark recipe (that simply copied input to output) and both jobs failed. Trying to understand if its simply unsupported or if I have a configuration issue.

Tagged:

Best Answer

Answers

  • yashpuranik
    yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

    Thanks @AlexT
    . To confirm, even with this approach, I cannot write a dataset one-off into S3 in the delta format? I can only use this JDBC connector if I have a Databricks environment set up?

    To give more context, I am working with some large datasets (~TBs). Each dataset arrives as a collection of gz csv files dropped in S3 buckets that are disproportionate (some files are ~100GB, others are ~KB). I am trying to determine what is the best way of organizing the datasets and querying them efficiently.

    The options I am exploring are:

    1. Converting the dataset into parquet and querying using Spark engine

    2. Converting the dataset into delta lake format and using the databricks executor engine on S3

    I don't have a database system such as Snowflake available. I am thinking I need to find the most optimal combination of data format and compute engine. (Reading the dataset as Dataiku partitioned dataset is not a logically available option based on how the data is structured).

    Do you have any other ideas?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Option 1) sounds suitable the you can read delta with DSS on S3 you can write Parquet to S3

Setup Info
    Tags
      Help me…