Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor ocr_interface.py Error Handling Improvement #3276

Open
AscendingGrass opened this issue Jun 22, 2024 · 1 comment
Open

Minor ocr_interface.py Error Handling Improvement #3276

AscendingGrass opened this issue Jun 22, 2024 · 1 comment
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).

Comments

@AscendingGrass
Copy link

In the ocr_interface.py file, it would be nice if the code handles the importlib.import_module(module_name) in the get_instance(...) function

@staticmethod
@functools.lru_cache(maxsize=None)
def get_instance(ocr_agent_module: str) -> "OCRAgent":
    module_name, class_name = ocr_agent_module.rsplit(".", 1)
    if module_name in OCR_AGENT_MODULES_WHITELIST:
        module = importlib.import_module(module_name)
        loaded_class = getattr(module, class_name)
        return loaded_class()
    else:
        raise ValueError(
            f"Environment variable OCR_AGENT module name {module_name}, must be set to a"
            f" whitelisted module part of {OCR_AGENT_MODULES_WHITELIST}.",
        )

I was so confused when I keep getting this error from the get_agent(...) function

ValueError: Environment variable OCR_AGENT must be set to an existing OCR agent module, not unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract.

when after hours of digging it turns out I just haven't installed pandas lol🗿

@AscendingGrass AscendingGrass added the enhancement New feature or request label Jun 22, 2024
@scanny scanny added the ocr Related to optical character recognition (OCR). label Jun 25, 2024
@scanny
Copy link
Collaborator

scanny commented Jun 25, 2024

@AscendingGrass the code producing the error message you were getting is actually here:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py#L38

class OCRAgent(ABC):
    """Defines the interface for an Optical Character Recognition (OCR) service."""

    @classmethod
    def get_agent(cls) -> OCRAgent:
        """Get the configured OCRAgent instance.

        The OCR package used by the agent is determined by the `OCR_AGENT` environment variable.
        """
        ocr_agent_cls_qname = cls._get_ocr_agent_cls_qname()
        try:
            return cls.get_instance(ocr_agent_cls_qname)
        except (ImportError, AttributeError):
            raise ValueError(
                f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
                f" not {ocr_agent_cls_qname}."
            )

The wording starts out very similarly :)

I'm thinking the right solution for this case is to print the original error messages as well as the "Environment variable ..." message.

I'm guessing the triggering error in your case was something like ImportError: cannot import package 'pandas' or whatever, and something like that might have been just the pointer you would need in that situation :)

What do you think about that idea?

Implementation would be something like:

        try:
            return cls.get_instance(ocr_agent_cls_qname)
        except (ImportError, AttributeError) as e:
            raise ValueError(
                f"Environment variable OCR_AGENT must be set to an existing OCR agent module,"
                f" not {ocr_agent_cls_qname}: {str(e)}"
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).
Projects
None yet
Development

No branches or pull requests

2 participants