Error with CSV dataset in Spark

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi,

When I try to process my CSV dataset on HDFS using Spark, I get error messages "java.io.IOException: Unterminated quoted field at the end of the file"

What is the reason ?

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Hi,

    Your dataset probably has multi-line records, which cannot be processed in Spark.

    Spark and Hadoop work by cutting input data files in segments and processing them in parallel. For CSV files, they cut at an arbitrary point in the file and look for an end-of-line and start processing from here.

    Thus, it is not really possible to process multi-line records in Spark (or Hadoop), since it might cut at the wrong place. We strongly recommend that you start by syncing your CSV dataset to a Parquet or ORC one (using the local DSS engine instead of Hadoop or Spark). As soon as you are on a "non-textual" format, you won't have issues anymore.

    Alternatively, this could also be caused by invalid quoting style: see http://answers.dataiku.com/561/unterminated-quoted-field-at-the-end-of-the-file
Setup Info
    Tags
      Help me…