Write recipe outputs (rows in initial order)

mouste04 · ‎06-08-2023

Hello everyone,

I am working on a recipe where I am using "write_with_schema" to write the recipe outputs. I noticed that the order of the rows on the output dataset is different than the expected one. I tried adding the .reset_index(). The indexes exist on the output but again not in the right order. In my case, the order of the rows is crucial.

Is there a way to have an output dataset where the rows are in the correct order using "write_with_schema" or any similar function?

Thank you in advance

SarinaS · ‎06-14-2023

Hi @mouste04,

Can you let us know what type of output dataset type (i.e. Snowflake, S3, filesystem etc) you writing to where you are seeing the output rows in a different order than you are seeing in your input dataframe that is passed to write_with_schema()?

Thanks,
Sarina

mouste04 · ‎06-14-2023

Hello Sarina ,

The type of input and output datasets is Snowflake. The input and output datasets are different. From the input, a linkage matrix is calculated and this is the output. Currently, I have managed to create an index for each row and then order the output dataset to proceed, but I was wandering if there was a better way to do that.

Thanks.

SarinaS · ‎06-15-2023

Hi @mouste04,

I see! For SQL datasets, the way that they are read in depends on the following Advanced dataset setting:

This can be applied to the input and output datasets. Does this help?

Thanks,
Sarina

mouste04 · ‎06-15-2023

Thank you, this is helpful!

Though, I was wandering if there is a more automated way to do that. Something that can be done in the backend or some function I can use instead of "write_with_schema" that automatically does that.

SarinaS · ‎06-16-2023

Hi @mouste04,

You could set this field automatically from the API. However, setting this at the dataset settings is the only way to set the read ordering for the dataset. If you wanted to do so from the API you could do something like this:

import dataiku

client = dataiku.api_client()
project = client.get_project('PROJECT_KEY')
dataset = project.get_dataset('DATASET')

settings = dataset.get_settings()
raw_settings = settings.get_raw()

raw_settings['readWriteOptions']['defaultReadOrdering']['enabled'] = True
raw_settings['readWriteOptions']['defaultReadOrdering']['rules'] = [{'columnName': 'COLUMN_VALUE', 'asc': True}]

settings.save()

Thanks,
Sarina

Sign up to take part

Write recipe outputs (rows in initial order)

Write recipe outputs (rows in initial order)