Error with CSV dataset in Spark
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
Hi,
When I try to process my CSV dataset on HDFS using Spark, I get error messages "java.io.IOException: Unterminated quoted field at the end of the file"
What is the reason ?
When I try to process my CSV dataset on HDFS using Spark, I get error messages "java.io.IOException: Unterminated quoted field at the end of the file"
What is the reason ?
Answers
-
Hi,
Your dataset probably has multi-line records, which cannot be processed in Spark.
Spark and Hadoop work by cutting input data files in segments and processing them in parallel. For CSV files, they cut at an arbitrary point in the file and look for an end-of-line and start processing from here.
Thus, it is not really possible to process multi-line records in Spark (or Hadoop), since it might cut at the wrong place. We strongly recommend that you start by syncing your CSV dataset to a Parquet or ORC one (using the local DSS engine instead of Hadoop or Spark). As soon as you are on a "non-textual" format, you won't have issues anymore.
Alternatively, this could also be caused by invalid quoting style: see http://answers.dataiku.com/561/unterminated-quoted-field-at-the-end-of-the-file