Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to redispatch a discrete partition through the sync recipe, using the process described in this documentation. I have tried running on the DSS Engine, however, on my end the recipe fails with the "Job process died (killed - maybe out of memory ?)" error. The Spark, Hive, and/or Imapala engine have the following warning: "Not available with redispatch partitioning". I tried using Hive as the engine anyway to see what would happen as I was still able to select the option since it was not greyed out, however, the recipe failed on my end with the "Job process died (killed - maybe out of memory ?)" error as well. I tried running only on 1 specific partition to validate that I am correctly identifying the partition on the back end, which worked for me so this means I am correctly selecting the partition. I then tried to manually input all of the individual partitions in the sync instead, and keep the redispatch option unchecked (ex: var_1/var_2/..../var_n) and ran on the Hive engine. This last approach worked on my end.
Although, the last approach works, I would like to ask:
1. Are there suggestions on how to better automate this process? In this current work around we would have to manually add any new partitions that arise into the options portion or alternatively manually edit a global variable and create a scenario. This process could become tedious when the number of possible partitions rises, and could lead to a higher chance of human error. Currently, to manually check for the total partitions we are creating a separate group by recipe on the column we want to partition by, then downloading that file, and finally creating a string with all the possible values. Is there a suggested best practice we should be following instead that would be more efficient and automated? Ideally we would like to utilize the redispatch option, however, it keeps failing using the DSS engine and does not behave as expected when selecting the other engines.
2. Will the Hive/Imapala/Spark engines be made available for redsipatch in the future? Since using the Hive engine worked when manually inputting all the partitions is there a way for Hive to behave the same way when the redispatch option is checked? We would ideally like to utilize the redispatch option with the Hive engine if this is possible.
Any input would be greatly appreciated, thanks.
What is the approximate number of partitions that you have to redispatch ? Redispatch works by creating one output writer for each partition it sees. If there is an extremely large number of partitions, this can require more memory than what is allocated for DSS engine jobs.
If your output dataset uses Parquet, you may want to try using CSV instead. Parquet requires extremely large amounts of memory to write, so redispatching to a large number of Parquet partitions is more likely to exceed memory allocation.
We do not have immediate plans to make redispatch partitioning on other engines, as that would require very significant work, but we will be taking your feedback in consideration.
Im in a similar situation. I have a ~8GB unpartitioned dataset distributed in 300 (parquet)files that I want to partition into ~150 partitions based on the value in a column. I currently have two solutions either is a viable alternative:
1) sync recipe in spark for named output partitions. Every executor reads the entire dataset and writes out one partition
2) Set the recipe level low (e.g. 2) and run with named partitions in DSS
3) checking Redispatch will cause OOM and writing to csv also fails on our rather beefy VM hosting DSS.
@Clément_Stenac Im looking for a solution where e.g a pySpark recipe writes the correct partitioned dataset and DSS recognizes the dataset as partitioned but will not attempt to launch a job for every output partition when the recipe is ran. Is that possible using recipe over writes or similar? Im on DSS 8.0.3
Also it puzzels me that the redispatch is not limited by the size of the data being dispatched but rather the number of output writers. How much memory would you expect 150 of such writers to use? I would prefer parquet output.
So the solution we choose was a little hacky but works.
In this two step process we have created the desired re-partitioned dataset using the spark engine.