Not able to extract documents in a zipped folder to HDFS

Shu · ‎02-26-2021

Hello community,

I was trying to pull some documents from a REST api to HDFS. Each document was compressed into a zipped folder in the archive.

The code I used is:

s = requests.Session()
r = s.get(archv_path)

# get it downloaded, and unzipped
z = zipfile.ZipFile(io.BytesIO(r.content))
try:
	foo = z.read(z.infolist()[0])
except IndexError:
	foo = ""
output_folder.upload_stream(partition_folder+filename, foo)

However, I noticed it leads to an issue: all the MS Word documents are downloaded as .docx file, whereas some documents are actually .doc file. And when I tried to read the text in the downloaded document, the decoding result is a bit messy. I used chardet to detect the encoding, it always returns 'Windows-1254' and recognise the language as Turkish. Even when I set the encoding = 'Windows-1254', it still does not get the text right.

stream = Input.get_download_stream(doc_path)
bytes_out_text = stream.read()
out_text = bytes_out_text.decode('utf-8', 'ignore')
print(out_text)

I guess it was caused by the unzip step but I am not sure how to fix it. I succeeded to use z.extractall() to get the document, but it seems like extractall() only works in local filesystem as a absolute path is required. I have been struggling with it for the whole day... Could you please help me with it? Any ideas or suggestions will be appreciated!

fchataigner2 · ‎03-01-2021

Hi,

where is the `filename` value coming from? Shouldn't it come from the ZipInfo `z.infolist()[0]` ? If extractall() does the job, why not download to the local filesystem (in a temp folder if needed), uncompress there, then upload_stream() the uncompressed files?

Shu · ‎03-01-2021

Hello,

I did not use the extractall() function because there are a big amount of documents involved. Will it influence the efficiency if I download and uncompress in local first?

For the filename, it comes from the REST api... I specified it here because I see in the documentation of upload_stream(), a target path of the file needs to be provided.

Thanks for prompt response!

fchataigner2 · ‎03-01-2021

Hi

for a zip with many files, it's usually be faster to uncompress off the local filesystem, because of the non-sequential accesses done by zip. For a single-file zip it may not make a big difference.

As you noted, `upload_stream()` needs a path, hence a filename, but is totally oblivious to what the name is, .doc or .docx doesn't matter for this method. I'd trust the filename that's effectively in the zip over a filename provided externally, unless the filename in the zip is totally mangled.

Out of curiosity, what are you reading the doc(x) files with?

Shu · ‎03-01-2021

Hello,

I have many zipped folders and each folder contains one document only - sorry I didn't explained it clearly. In this case, is it still faster to uncompress in the local filesystem?

As you suggested, I tried to use the filename provided in zip.infolist()[0] when uploading the stream, instead of the one that I fetched from REST api. However I noticed that in this way, PDF documents are treated as binary file and thus no preview available. And I tried to download the PDF document, it appears to be a .pdf_ file. Do you happen to have an idea about why it happens?

fchataigner2 · ‎03-01-2021

Hi,

uncompress in local filesystem will probably be a bit slower than uncompress from in-memory buffer, but the performance hit should be negligible (disk and network I/O probably dominates the runtime), so you shouldn't refrain from using that option.

Can you share one of these zips that produce a .pdf_ file?

Shu · ‎03-04-2021

Hello,

I have checked again the scripts and realized that the .pdf_ extension occurs when the filename (extracted from zip.listinfo() ) contains a space in the end. With the space removed, it returns a normal .pdf file. So the uncompressing step was finished smoothly.

One follow-up question, I also have a few .doc files in the data base. As far as I know, the textract package , which is commonly used to read .doc files, does not accept stream as an input. The python-docx package accepts stream but it does not read .doc file. It seems like I have to read the .doc files in the local filesystem after the uncompressing step... do you happen to know any other options that allows to read .doc file from stream?

Many thanks!

Sign up to take part

Not able to extract documents in a zipped folder to HDFS

Not able to extract documents in a zipped folder to HDFS