Writing to partitioned dataset using the writer

SanderVW · March 2024

I'm trying to overwrite a table using data from another table (with the same schema). I keep running into the issue that both datasets are partitioned and the writer does not like that (same case with the copy_to function).

Here is what I'm trying to do:

ERROR:dataiku.core.dataset_write:Exception caught while writing
Traceback (most recent call last):
  File "/data/dataiku/install/dataiku-dss-12.3.1/python/dataiku/core/dataset_write.py", line 353, in run
    self.streaming_api.wait_write_session(self.session_id)
  File "/data/dataiku/install/dataiku-dss-12.3.1/python/dataiku/core/dataset_write.py", line 296, in wait_write_session
    raise Exception(u'An error occurred during dataset write (%s): %s' % (id, decoded_resp["message"]))
Exception: An error occurred during dataset write (D9uuBrAH9P): RuntimeException: A partition ID must be provided, because the dataset myproject.target_table is partitioned

I'm getting this error, even though I would think the writer has a partition because of the "set_write_partition":

# Set the datasets
dataset_target = dataiku.Dataset(dataset, project_key=project_key_target)
dataset_source = dataiku.Dataset(dataset, project_key=project_key_source)

# Overwrite target dataset with source data
with dataset_target.get_writer() as writer:
  for p in dataset_source.list_partitions():
    dataset_source.read_partitions = [p]
    df = dataset_source.get_dataframe()
    dataset_target.set_write_partition(str(p))
    writer.write_dataframe(df)
  writer.close()

Does anyone know how I could resolve this? I also though about subverting the issue by removing the partitioning from the datasets altogether and replacing them after the copy but can imagine more going wrong there so would like to avoid that if possible.

Any help is appreciated, thanks in advance!

SanderVW · March 2024

Figured it out!

The writer needs to be given the partition before being defined. This means we can solve it like this:

# Set the datasets
dataset_target = dataiku.Dataset(dataset, project_key=project_key_target)
dataset_source = dataiku.Dataset(dataset, project_key=project_key_source)

# Overwrite target dataset with source data
for p in dataset_source.list_partitions():
    dataset_source.read_partitions = [p]
    df = dataset_source.get_dataframe()
    dataset_target.set_write_partition(str(p))
    writer = dataset.target.get_writer()
    writer.write_dataframe(df)
    writer.close()

Achintya · January 7

https://community.dataiku.com/discussion/comment/41845#Comment_41845

Hi @SanderVW ,

Thanks for the solution, it's really helpful. I just have one small doubt regarding this, whenever we run this script will it overwrite all the partitions or just append the new partitions in the current dataset. Let me explain with an example for better understanding, let's say from the below code we get 10 partitions:

for p in dataset_source.list_partitions():

And after the complete code will run successfully these 10 partitions will be written from source to target dataset. But let's say we run this code again after a month and now the partitions are 12 in source dataset but all the 12 partitions are new. so this script will append those 12 new partitions making the target dataset have 22 partitions (10 old ones & 12 new ones), or it will overwrite the 12 new partitions making the target dataset only have 12 partitions (New ones).

It would be really helpful if you could help me figure out this.

Thanks !

Writing to partitioned dataset using the writer

Best Answer

Answers

Categories

Setup Info

Tags