Partitioning and chunking with python recipe

Options
MRvLuijpen
MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron

Hello dear community,

I have been experimenting with both @partitioning and @chunking datasets, and now have a problem.

My input dataset is partitioned, but the partitions still contain a lot of records, thus I want to process the partitions with @python in chunks.

In the documentation on https://doc.dataiku.com/dss/latest/python-api/datasets.html I did find how I can write the chunks and also adjusted this a bit, so the first time the schema was written. However, this does not seem to work correctly in combination with partitioning, since then the script is run 5 times in parallel.

Thanks in advance...

Answers

  • MRvLuijpen
    MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron
    Options
    by uncommenting the lines, and rerun the recipe several times, I was able to remove the warning, however, it does not really 'feel' as a correct way of working
  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Options

    Hi,

    I don't think we understand what the exact issue is. What is the warning you encountered ? Note that you can prevent parallel executions of the recipe in partitioning mode, by setting the parallelism limit in the Advanced settings of the recipe.

  • MRvLuijpen
    MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron
    Options

    The error message was: "table already exists but with an incompatible schema:"

    In the run with 72 partitions, 2 of these failed, while 70 ran without errors (see screen shot attached).

    I have included part of the log file, for both a successful and failed run.

Setup Info
    Tags
      Help me…