stacking error!

PARA
PARA Registered Posts: 7 ✭✭✭

I get the below error while stacking a dataset. Not sure what is the error can someone please help!

Job failed: Job aborted., caused by: SparkException: Job aborted due to stage failure: Task 179 in stage 0.0 failed 4 times, most recent failure: Lost task 179.3 in stage 0.0 (TID 187, hklvathdp015.hk.standardchartered.com, executor 4): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:178) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unterminated multi-line record at the end of the file at com.dataiku.dip.input.formats.csv.EscapingOnlyCSVParser.next(EscapingOnlyCSVParser.java:30) at com.dataiku.dip.shaker.mrimpl.formats.CSVInputFormatAdapter$InternalRecordReader.nextKeyValue(CSVInputFormatAdapter.java:141) at com.dataiku.dip.shaker.mrimpl.formats.UniversalFileInputFormat$1.nextKeyValue(UniversalFileInputFormat.java:155) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:146) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:144) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1374) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:159) ... 8 more Driver stacktrace:, caused by: SparkException: Task failed while writing rows, caused by: IOException: Unterminated multi-line record at the end of the file

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    Hi,

    this error is a telltale sign of a misconfigured CSV file or of CSV data containing multiline fields. Misconfiguration means either the style (excel vs. unix) or the separator or quoting character are incorrect. This leads the CSV parser astray and it ends up being unable to find the end of a quoted field, thus reading all the way to the end of file. You need to adjust the settings of the CSV format on the input dataset Settings > Preview tab.

    Multiline fields in CSV are generally not safe for Spark, so if your data contains some, you need to either use the DSS stream engine for the stack recipe, instead of Spark, or consider getting this data in a non-CSV format like Parquet or ORC, where there arent't any issues with multiline fields

Setup Info
    Tags
      Help me…