sync partitioned dataset to GCS

Meirkhan
Meirkhan Registered Posts: 20 ✭✭✭✭
edited July 16 in Using Dataiku

Hello dataikers!

I have a flow which is like following: Postgres dataset -> sync to GCS -> sync to BQ

Everything is non-partitioned so far.

Dataset syncs to GCS pretty easily, but since size is pretty massive, while syncing from GCS to BigQuery

BigQuery throws error due to GCS single file limit -

Error while reading data, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 16992300985. Max allowed size is: 4294967296

So, what I need to do is - write to GCS with partitions.

What I tried was:

gcs_dataset = dataiku_project.get_dataset(dataset_name)

gcs_def = gcs_dataset.get_definition()

gcs_def['partitioning'] = {
    'ignoreNonMatchingFile': False,
    'considerMissingRequestedPartitionsAsEmpty': False,
    'dimensions': [{'name': 'partition_id', 'type': 'value'}]
}
gcs_dataset.set_definition(gcs_def)

which kinda allows me to partition the dataset, but I couldn't find how to set partition dimensions afterwards.

And when I set them manually, I get following error:

Job failed: Root path does not exist: Root path of the dataset does not exist

So, my questions:

1. Is it possible to do what I want via API? (I don't need GUI solutions)

2. If yes, is it possible to do so without partitioning input dataset (Postgres)

3. If yes and yes, how to do so?

Thanks in advance!

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Answer ✓

    Hi

    to make partitioning work, you need to pass a `filePathPattern` pattern too, something like "/%{partition_id}/.*" . Then since you're doing a non-partitioned to partitioned sync (the PG->GCS one), you need to set `redispatchPartitioning` to true in the Sync recipe's definition. Finally, you can partition the output BQ dataset too, in which case you'll need to run the job by specifying the list of values for partition_id, or you can leave it non-partitioned and setup the partition dependency on the GCS->BQ sync to all_available (note that in that case you need to build the GCS dataset first, then the BQ one; you can't do both in the same job or DSS will have troubles listing the partitions in the GCS dataset)

Answers

Setup Info
    Tags
      Help me…