ERROR: poppler no found on variable PATH
Hello everyone,
I'm facing an issue with a Python function embedded in a Dataiku API. The business case involves applying OCR to PDFs, and I'm encountering two problems:
The PATH environment variable throws an error, stating that it cannot find the poppler-utils library within that variable.
Failed: Failed to run function: <class 'pytesseract.pytesseract.TesseractNotFoundError'>: [Errno None] None: 'None'. The Python code I'm developing is:
from flask import Flask, jsonify, request
from pprint import pprint
import numpy as np
import pandas as pd
#import requests
import json
from decimal import Decimal
from flask_cors import CORS
from datetime import datetime
import Functions_EVision as EVision
from pdf2image.exceptions import PDFPageCountError
import os
from os import environ as env
from dotenv import load_dotenv
from pdf2image import convert_from_path, convert_from_bytes
from urllib.request import Request, urlopen
import io #from memory files to bytes
import ssl
from urllib.request import Request, urlopen
# Deshabilitar verificación de certificado SSL
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
def predict(Path,OutputType,NumPages):
remote_file = urlopen(Request(ruta), context=context).read()
print('remote file generated')
try:
print('try pdf')
ObjetoImagen = convert_from_bytes(remote_file
, 350
#,poppler_path = path_Poller_pdf2image
)
extension = 'PDF'
except PDFPageCountError :
print('try image')
ObjetoImagen = ' '
extension = ' '
print('try EVIsion')
Text = EVision.EVision_OCRModel(path = ruta
, ExtensionDoc = extension
, NumPages = NumPages
, ImageObject = ObjetoImagen
, TipoOutput = OutputType
)
mensaje = {
"Document_Text" : Text[0],
"Accuracy" : Text[1]
}
t_f=datetime.now()
TimeRunning = t_f-t_0
delta_seg = TimeRunning.total_seconds()
delta_min = delta_seg/60
print('Execution time', delta_seg , 'sec')
print('Execution time', delta_min , 'min')
return jsonify(mensaje)
Thanks in adanced for your assistance.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,990 Neuron
Hi, please edit your code snippet and use a code block (the <\> icon in the toolbar) so we can see the proper padding of your code and try to reproduce the issue. As you know Python code will not without proper padding.
In your imports you have Flask, why is this needed? This is a Python API function not a Flask webapp. Please review your imports and remove anything that's not required.
With regards to the PATH error can you please clarify if you are using a Python code environment in your API. The code environment associated to an endpoint can be configured in the “Settings” tab of the endpoint. This is the recommended way of adding packages to API endpoint functions. If you are not using a code environment then create one and set as above. Then redeploy the API service and Dataiku will deploy the code environment to the API node for you.