sync partitioned dataset to GCS

Solved!
Meirkhan
Level 3
sync partitioned dataset to GCS

Hello dataikers!

 

I have a flow which is like following: Postgres dataset -> sync to GCS -> sync to BQ

Everything is non-partitioned so far.

 

Dataset syncs to GCS pretty easily, but since size is pretty massive, while syncing from GCS to BigQuery

BigQuery throws error due to GCS single file limit - 

Error while reading data, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 16992300985. Max allowed size is: 4294967296

 

So, what I need to do is - write to GCS with partitions

What I tried was:

 

gcs_dataset = dataiku_project.get_dataset(dataset_name)

gcs_def = gcs_dataset.get_definition()

gcs_def['partitioning'] = {
    'ignoreNonMatchingFile': False,
    'considerMissingRequestedPartitionsAsEmpty': False,
    'dimensions': [{'name': 'partition_id', 'type': 'value'}]
}
gcs_dataset.set_definition(gcs_def)

 

which kinda allows me to partition the dataset, but I couldn't find how to set partition dimensions afterwards.

 

And when I set them manually, I get following error:

Job failed: Root path does not exist: Root path of the dataset does not exist

 

So, my questions:

1. Is it possible to do what I want via API? (I don't need GUI solutions)

2. If yes, is it possible to do so without partitioning input dataset (Postgres)

3. If yes and yes, how to do so?

 

Thanks in advance!

0 Kudos
1 Solution
fchataigner2
Dataiker

Hi

to make partitioning work, you need to pass a `filePathPattern` pattern too, something like "/%{partition_id}/.*" . Then since you're doing a non-partitioned to partitioned sync (the PG->GCS one), you need to set `redispatchPartitioning` to true in the Sync recipe's definition. Finally, you can partition the output BQ dataset too, in which case you'll need to run the job by specifying the list of values for partition_id, or you can leave it non-partitioned and setup the partition dependency on the GCS->BQ sync to all_available (note that in that case you need to build the GCS dataset first, then the BQ one; you can't do both in the same job or DSS will have troubles listing the partitions in the GCS dataset)

 

View solution in original post

0 Kudos
3 Replies
fchataigner2
Dataiker

Hi

to make partitioning work, you need to pass a `filePathPattern` pattern too, something like "/%{partition_id}/.*" . Then since you're doing a non-partitioned to partitioned sync (the PG->GCS one), you need to set `redispatchPartitioning` to true in the Sync recipe's definition. Finally, you can partition the output BQ dataset too, in which case you'll need to run the job by specifying the list of values for partition_id, or you can leave it non-partitioned and setup the partition dependency on the GCS->BQ sync to all_available (note that in that case you need to build the GCS dataset first, then the BQ one; you can't do both in the same job or DSS will have troubles listing the partitions in the GCS dataset)

 

0 Kudos
Meirkhan
Level 3
Author

.

0 Kudos
Meirkhan
Level 3
Author

Thanks for response, you gave me the right directions.

However, I am having troubles modifying sync recipe.

 

I opened another topic for this issue  here 

0 Kudos