Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Open
Kain-90 opened this issue Aug 12, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@Kain-90
Copy link
Contributor

Kain-90 commented Aug 12, 2024

The code below:

def load_document_content(file_path):
    if Path(file_path).suffix.lower() == ".pdf":
        print("in if")
        return PyMuPDFLoader(file_path)
    else:
        print("in else")
        return UnstructuredFileLoader(
            file_path, mode="elements", autodetect_encoding=True
        )
@kartikpersistent kartikpersistent added the question Further information is requested label Aug 12, 2024
@aashipandya
Copy link
Contributor

Hi @Kain-90
We are using specific loader for pdfs because for pdf documents we are getting better metadata(filesize, page number,etc) from PyMuPDFLoader as compared to UnstructuredFileLoader.

@Kain-90
Copy link
Contributor Author

Kain-90 commented Aug 13, 2024

Hi @Kain-90 We are using specific loader for pdfs because for pdf documents we are getting better metadata(filesize, page number,etc) from PyMuPDFLoader as compared to UnstructuredFileLoader.

Hi, thx for your reply.

I tried it, it does not have the filesize in metadata, but it has the page_number.

{'source': '~/Downloads/invalid-pdf-structure-pdfminer-one-page.pdf', 'coordinates': {'points': ((25.0, 666.827), (25.0, 756.042), (451.07179999999994, 756.042), (451.07179999999994, 666.827)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'file_directory': '~/Downloads', 'filename': 'invalid-pdf-structure-pdfminer-one-page.pdf', 'languages': ['eng'], 'last_modified': '2024-08-12T17:31:09', 'page_number': 1, 'filetype': 'application/pdf', 'parent_id': 'c1757fc3797d44f224a3cb1e57864016'}

@Kain-90 Kain-90 closed this as completed Aug 13, 2024
@Kain-90
Copy link
Contributor Author

Kain-90 commented Aug 13, 2024

I suddenly realized that the attribute filesize is not fetched via PyMuPDFLoader, but is returned via this method when uploading the file, so what other richer metadata seems to be not being used? Or am I looking in the wrong place.

def merge_chunks_local(file_name, total_chunks, chunk_dir, merged_dir):

    if not os.path.exists(merged_dir):
        os.mkdir(merged_dir)
    logging.info(f"Merged File Path: {merged_dir}")
    merged_file_path = os.path.join(merged_dir, file_name)
    with open(merged_file_path, "wb") as write_stream:
        for i in range(1, total_chunks + 1):
            chunk_file_path = os.path.join(chunk_dir, f"{file_name}_part_{i}")
            logging.info(f"Chunk File Path While Merging Parts:{chunk_file_path}")
            with open(chunk_file_path, "rb") as chunk_file:
                shutil.copyfileobj(chunk_file, write_stream)
            os.unlink(chunk_file_path)  # Delete the individual chunk file after merging
    logging.info("Chunks merged successfully and return file size")
    file_name, pages, file_extension = get_documents_from_file_by_path(
        merged_file_path, file_name
    )
    pdf_total_pages = pages[0].metadata["total_pages"]
    file_size = os.path.getsize(merged_file_path)
    return pdf_total_pages, file_size

@Kain-90 Kain-90 reopened this Aug 13, 2024
@Kain-90
Copy link
Contributor Author

Kain-90 commented Aug 13, 2024

And for the PDF file type, the following code will make it simpler to obtain the page number.

def load_document_content(file_path):
    if Path(file_path).suffix.lower() == ".pdf":
        return UnstructuredFileLoader(
            file_path, mode="paged", autodetect_encoding=True
        )
    ...

loader = load_document_content(...)
page_number = len(loader.load())

@kartikpersistent
Copy link
Contributor

@aashipandya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants