I am trying to redispatch a discrete partition through the sync recipe, using the process described in this documentation. I have tried running on the DSS Engine, however, on my end the recipe fails with the "Job process died (killed - maybe out of memory ?)" error. The Spark, Hive, and/or Imapala engine have the following warning: "Not available with redispatch partitioning". I tried using Hive as the engine anyway to see what would happen as I was still able to select the option since it was not greyed out, however, the recipe failed on my end with the "Job process died (killed - maybe out of memory ?)" error as well. I tried running only on 1 specific partition to validate that I am correctly identifying the partition on the back end, which worked for me so this means I am correctly selecting the partition. I then tried to manually input all of the individual partitions in the sync instead, and keep the redispatch option unchecked (ex: var_1/var_2/..../var_n) and ran on the Hive engine. This last approach worked on my end.
Although, the last approach works, I would like to ask:
1. Are there suggestions on how to better automate this process? In this current work around we would have to manually add any new partitions that arise into the options portion or alternatively manually edit a global variable and create a scenario. This process could become tedious when the number of possible partitions rises, and could lead to a higher chance of human error. Currently, to manually check for the total partitions we are creating a separate group by recipe on the column we want to partition by, then downloading that file, and finally creating a string with all the possible values. Is there a suggested best practice we should be following instead that would be more efficient and automated? Ideally we would like to utilize the redispatch option, however, it keeps failing using the DSS engine and does not behave as expected when selecting the other engines.
2. Will the Hive/Imapala/Spark engines be made available for redsipatch in the future? Since using the Hive engine worked when manually inputting all the partitions is there a way for Hive to behave the same way when the redispatch option is checked? We would ideally like to utilize the redispatch option with the Hive engine if this is possible.
Any input would be greatly appreciated, thanks.
What is the approximate number of partitions that you have to redispatch ? Redispatch works by creating one output writer for each partition it sees. If there is an extremely large number of partitions, this can require more memory than what is allocated for DSS engine jobs.
If your output dataset uses Parquet, you may want to try using CSV instead. Parquet requires extremely large amounts of memory to write, so redispatching to a large number of Parquet partitions is more likely to exceed memory allocation.
We do not have immediate plans to make redispatch partitioning on other engines, as that would require very significant work, but we will be taking your feedback in consideration.