Too many lines in my dataset in Hive
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
I have a dataset with free text stored in HDFS, in CSV format.
When I go to explore view, everything looks OK. However, when I query the table in the Hive notebook, I see several lines for each line of the entry. It looks like the \n in my original file are not properly escaped, and are considered as new lines.
When I go to explore view, everything looks OK. However, when I query the table in the Hive notebook, I see several lines for each line of the entry. It looks like the \n in my original file are not properly escaped, and are considered as new lines.
Tagged:
Best Answer
-
Hi,
Unfortunately, this is inherent to the way Hadoop (and therefore Hive) handle "Text files" (under which CSV fall). In order to be able to distribute the various chunks of a file, Hadoop splits the file based on \n at arbitrary offsets and cannot handle multi-line CSV fields.
When processing data on HDFS, we strongly advise to use dedicated file formats like ORC or Parquet, that provide both far better performance and better compatibility.