Partitioning and chunking with python recipe

MRvLuijpen · ‎02-21-2020

Hello dear community,

I have been experimenting with both @partitioning and @chunking datasets, and now have a problem.

My input dataset is partitioned, but the partitions still contain a lot of records, thus I want to process the partitions with @python in chunks.

In the documentation on https://doc.dataiku.com/dss/latest/python-api/datasets.html I did find how I can write the chunks and also adjusted this a bit, so the first time the schema was written. However, this does not seem to work correctly in combination with partitioning, since then the script is run 5 times in parallel.

Thanks in advance...

MRvLuijpen · ‎02-21-2020

by uncommenting the lines, and rerun the recipe several times, I was able to remove the warning, however, it does not really 'feel' as a correct way of working

Clément_Stenac · ‎02-21-2020

Hi,

I don't think we understand what the exact issue is. What is the warning you encountered ? Note that you can prevent parallel executions of the recipe in partitioning mode, by setting the parallelism limit in the Advanced settings of the recipe.

MRvLuijpen · ‎02-21-2020

The error message was: "table already exists but with an incompatible schema:"

In the run with 72 partitions, 2 of these failed, while 70 ran without errors (see screen shot attached).

I have included part of the log file, for both a successful and failed run.

Sign up to take part

Partitioning and chunking with python recipe

Partitioning and chunking with python recipe