Write recipe outputs (rows in initial order)

mouste04
Level 1
Write recipe outputs (rows in initial order)

Hello everyone,

I am working on a recipe where I am using "write_with_schema" to write the recipe outputs. I noticed that the order of the rows on the output dataset is different than the expected one. I tried adding the .reset_index().  The indexes exist on the output but again not in the right order. In my case, the order of the rows is crucial.

 Is there a way to have an output dataset where the rows are in the correct order using "write_with_schema" or any similar function?

Thank you in advance

0 Kudos
5 Replies
SarinaS
Dataiker

Hi @mouste04,

Can you let us know what type of output dataset type (i.e. Snowflake, S3, filesystem etc) you writing to where you are seeing the output rows in a different order than you are seeing in your input dataframe that is passed to write_with_schema()?

Thanks,
Sarina 

0 Kudos
mouste04
Level 1
Author

Hello Sarina ,

The type of input and output datasets is Snowflake. The input and output datasets are different. From the input, a linkage matrix is calculated and this is the output. Currently, I have managed to create an index for each row and then order the output dataset to proceed, but I was wandering if there was a better way to do that.

Thanks. 

0 Kudos
SarinaS
Dataiker

Hi @mouste04,

I see! For SQL datasets, the way that they are read in depends on the following Advanced dataset setting:

Screenshot 2023-06-14 at 5.43.58 PM.png

This can be applied to the input and output datasets. Does this help?

Thanks,
Sarina

 

0 Kudos
mouste04
Level 1
Author

Thank you, this is helpful!

Though, I was wandering if there is a more automated way to do that. Something that can be done in the backend or some function I can use instead of "write_with_schema" that automatically does that.

0 Kudos
SarinaS
Dataiker

Hi @mouste04,

You could set this field automatically from the API. However, setting this at the dataset settings is the only way to set the read ordering for the dataset. If you wanted to do so from the API you could do something like this:

import dataiku

client = dataiku.api_client()
project = client.get_project('PROJECT_KEY')
dataset = project.get_dataset('DATASET')

settings = dataset.get_settings()
raw_settings = settings.get_raw()

raw_settings['readWriteOptions']['defaultReadOrdering']['enabled'] = True
raw_settings['readWriteOptions']['defaultReadOrdering']['rules'] = [{'columnName': 'COLUMN_VALUE', 'asc': True}]

settings.save()

 
Thanks,
Sarina

0 Kudos