sync partitioned dataset to GCS

Meirkhan · October 2021

Hello dataikers!

I have a flow which is like following: Postgres dataset -> sync to GCS -> sync to BQ

Everything is non-partitioned so far.

Dataset syncs to GCS pretty easily, but since size is pretty massive, while syncing from GCS to BigQuery

BigQuery throws error due to GCS single file limit -

Error while reading data, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 16992300985. Max allowed size is: 4294967296

So, what I need to do is - write to GCS with partitions.

What I tried was:

gcs_dataset = dataiku_project.get_dataset(dataset_name)

gcs_def = gcs_dataset.get_definition()

gcs_def['partitioning'] = {
    'ignoreNonMatchingFile': False,
    'considerMissingRequestedPartitionsAsEmpty': False,
    'dimensions': [{'name': 'partition_id', 'type': 'value'}]
}
gcs_dataset.set_definition(gcs_def)

which kinda allows me to partition the dataset, but I couldn't find how to set partition dimensions afterwards.

And when I set them manually, I get following error:

Job failed: Root path does not exist: Root path of the dataset does not exist

So, my questions:

1. Is it possible to do what I want via API? (I don't need GUI solutions)

2. If yes, is it possible to do so without partitioning input dataset (Postgres)

3. If yes and yes, how to do so?

Thanks in advance!

fchataigner2 · October 2021

Hi

to make partitioning work, you need to pass a `filePathPattern` pattern too, something like "/%{partition_id}/.*" . Then since you're doing a non-partitioned to partitioned sync (the PG->GCS one), you need to set `redispatchPartitioning` to true in the Sync recipe's definition. Finally, you can partition the output BQ dataset too, in which case you'll need to run the job by specifying the list of values for partition_id, or you can leave it non-partitioned and setup the partition dependency on the GCS->BQ sync to all_available (note that in that case you need to build the GCS dataset first, then the BQ one; you can't do both in the same job or DSS will have troubles listing the partitions in the GCS dataset)

Meirkhan · October 2021

Meirkhan · October 2021

Thanks for response, you gave me the right directions.

However, I am having troubles modifying sync recipe.

I opened another topic for this issue here

sync partitioned dataset to GCS

Error while reading data, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 16992300985. Max allowed size is: 4294967296

Job failed: Root path does not exist: Root path of the dataset does not exist

Best Answer

Answers

Categories

Setup Info

Tags