Repartition discrete partitions on DSS engine failed, but manual input worked, suggestions?

kathyqingyuxu
Repartition discrete partitions on DSS engine failed, but manual input worked, suggestions?

Hello team,

I am trying to redispatch a discrete partition through the sync recipe, using the process described in this documentation. I have tried running on the DSS Engine, however, on my end the recipe fails with the "Job process died (killed - maybe out of memory ?)" error. The Spark, Hive, and/or Imapala engine have the following warning: "Not available with redispatch partitioning". I tried using Hive as the engine anyway to see what would happen as I was still able to select the option since it was not greyed out, however, the recipe failed on my end with the "Job process died (killed - maybe out of memory ?)" error as well. I tried running only on 1 specific partition to validate that I am correctly identifying the partition on the back end, which worked for me so this means I am correctly selecting the partition. I then tried to manually input all of the individual partitions in the sync instead, and keep the redispatch option unchecked (ex: var_1/var_2/..../var_n) and ran on the Hive engine. This last approach worked on my end.

 

Although, the last approach works, I would like to ask:

1. Are there suggestions on how to better automate this process? In this current work around we would have to manually add any new partitions that arise into the options portion or alternatively manually edit a global variable and create a scenario. This process could become tedious when the number of possible partitions rises, and could lead to a higher chance of human error. Currently, to manually check for the total partitions we are creating a separate group by recipe on the column we want to partition by, then downloading that file, and finally creating a string with all the possible values. Is there a suggested best practice we should be following instead that would be more efficient and automated? Ideally we would like to utilize the redispatch option, however, it keeps failing using the DSS engine and does not behave as expected when selecting the other engines.

2. Will the Hive/Imapala/Spark engines be made available for redsipatch in the future? Since using the Hive engine worked when manually inputting all the partitions is there a way for Hive to behave the same way when the redispatch option is checked? We would ideally like to utilize the redispatch option with the Hive engine if this is possible.

 

Any input would be greatly appreciated, thanks.

Best,

Kathy

5 Replies
Clément_Stenac
Dataiker

Hi,

What is the approximate number of partitions that you have to redispatch ? Redispatch works by creating one output writer for each partition it sees. If there is an extremely large number of partitions, this can require more memory than what is allocated for DSS engine jobs.

If your output dataset uses Parquet, you may want to try using CSV instead. Parquet requires extremely large amounts of memory to write, so redispatching to a large number of Parquet partitions is more likely to exceed memory allocation.

We do not have immediate plans to make redispatch partitioning on other engines, as that would require very significant work, but we will be taking your feedback in consideration.

kresten
Level 2

Im in a similar situation. I have a ~8GB unpartitioned dataset distributed in 300 (parquet)files that I want to partition into ~150 partitions based on the value in a column. I currently have two solutions either is a viable alternative:

1) sync recipe in spark for named output partitions. Every executor reads the entire dataset and writes out one partition

2) Set the recipe level low (e.g. 2) and run with named partitions in DSS

3) checking Redispatch will cause OOM and writing to csv also fails on our rather beefy VM hosting DSS.

@Clément_Stenac  Im looking for a solution where e.g a pySpark recipe writes the correct partitioned dataset  and DSS recognizes the dataset as partitioned but will not attempt to launch a job for every output partition when the recipe is ran. Is that possible using recipe over writes or similar? Im on DSS 8.0.3

Thanks

0 Kudos
kresten
Level 2

Also it puzzels me that the redispatch is not limited by the size of the data being dispatched but rather the number of output writers. How much memory would you expect 150 of such writers to use? I would prefer parquet output.

0 Kudos
kresten
Level 2

So the solution we choose was a little hacky but works.

  1.  Create a pySpark recipe that has an unpartitioned dataset as output
    1. In the recipe use pySpark to write a partitioned dataset to `path` 
    2. Dont let DSS output any data to the designated output dataset.
  2. In a second pySpark recipe following the first with output to a DSS partitioned dataset we;
    1. Has empty code except for `wait(1)`. Do not write any data to the output dataset.
    2. We manually set the dataset path to the path of the data written in step 1
    3. We specify the partition specification in the DSS partitioned dataset

In this two step process we have created the desired re-partitioned dataset using the spark engine.

 

 

tanguy

We faced the same error ("Job process died (killed - maybe out of memory ?)") when trying to partition using the redispatch option with parquet format.

It eventually worked by applying @Clément_Stenac's adivce:

  1. we managed to partition the table by using a csv output format
  2. we then converted the csv partitions into parquet
0 Kudos