I was trying to pull some documents from a REST api to HDFS. Each document was compressed into a zipped folder in the archive.
The code I used is:
s = requests.Session() r = s.get(archv_path) # get it downloaded, and unzipped z = zipfile.ZipFile(io.BytesIO(r.content)) try: foo = z.read(z.infolist()) except IndexError: foo = "" output_folder.upload_stream(partition_folder+filename, foo)
However, I noticed it leads to an issue: all the MS Word documents are downloaded as .docx file, whereas some documents are actually .doc file. And when I tried to read the text in the downloaded document, the decoding result is a bit messy. I used chardet to detect the encoding, it always returns 'Windows-1254' and recognise the language as Turkish. Even when I set the encoding = 'Windows-1254', it still does not get the text right.
stream = Input.get_download_stream(doc_path) bytes_out_text = stream.read() out_text = bytes_out_text.decode('utf-8', 'ignore') print(out_text)
I guess it was caused by the unzip step but I am not sure how to fix it. I succeeded to use z.extractall() to get the document, but it seems like extractall() only works in local filesystem as a absolute path is required. I have been struggling with it for the whole day... Could you please help me with it? Any ideas or suggestions will be appreciated!
where is the `filename` value coming from? Shouldn't it come from the ZipInfo `z.infolist()` ? If extractall() does the job, why not download to the local filesystem (in a temp folder if needed), uncompress there, then upload_stream() the uncompressed files?
I did not use the extractall() function because there are a big amount of documents involved. Will it influence the efficiency if I download and uncompress in local first?
For the filename, it comes from the REST api... I specified it here because I see in the documentation of upload_stream(), a target path of the file needs to be provided.
Thanks for prompt response!
for a zip with many files, it's usually be faster to uncompress off the local filesystem, because of the non-sequential accesses done by zip. For a single-file zip it may not make a big difference.
As you noted, `upload_stream()` needs a path, hence a filename, but is totally oblivious to what the name is, .doc or .docx doesn't matter for this method. I'd trust the filename that's effectively in the zip over a filename provided externally, unless the filename in the zip is totally mangled.
Out of curiosity, what are you reading the doc(x) files with?
I have many zipped folders and each folder contains one document only - sorry I didn't explained it clearly. In this case, is it still faster to uncompress in the local filesystem?
As you suggested, I tried to use the filename provided in zip.infolist() when uploading the stream, instead of the one that I fetched from REST api. However I noticed that in this way, PDF documents are treated as binary file and thus no preview available. And I tried to download the PDF document, it appears to be a .pdf_ file. Do you happen to have an idea about why it happens?
uncompress in local filesystem will probably be a bit slower than uncompress from in-memory buffer, but the performance hit should be negligible (disk and network I/O probably dominates the runtime), so you shouldn't refrain from using that option.
Can you share one of these zips that produce a .pdf_ file?
I have checked again the scripts and realized that the .pdf_ extension occurs when the filename (extracted from zip.listinfo() ) contains a space in the end. With the space removed, it returns a normal .pdf file. So the uncompressing step was finished smoothly.
One follow-up question, I also have a few .doc files in the data base. As far as I know, the textract package , which is commonly used to read .doc files, does not accept stream as an input. The python-docx package accepts stream but it does not read .doc file. It seems like I have to read the .doc files in the local filesystem after the uncompressing step... do you happen to know any other options that allows to read .doc file from stream?