Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Kain-90 · 2024-08-12T05:12:51Z

The code below:

def load_document_content(file_path):
    if Path(file_path).suffix.lower() == ".pdf":
        print("in if")
        return PyMuPDFLoader(file_path)
    else:
        print("in else")
        return UnstructuredFileLoader(
            file_path, mode="elements", autodetect_encoding=True
        )

The text was updated successfully, but these errors were encountered:

aashipandya · 2024-08-12T11:08:12Z

Hi @Kain-90
We are using specific loader for pdfs because for pdf documents we are getting better metadata(filesize, page number,etc) from PyMuPDFLoader as compared to UnstructuredFileLoader.

Kain-90 · 2024-08-13T01:14:24Z

Hi @Kain-90 We are using specific loader for pdfs because for pdf documents we are getting better metadata(filesize, page number,etc) from PyMuPDFLoader as compared to UnstructuredFileLoader.

Hi, thx for your reply.

I tried it, it does not have the filesize in metadata, but it has the page_number.

{'source': '~/Downloads/invalid-pdf-structure-pdfminer-one-page.pdf', 'coordinates': {'points': ((25.0, 666.827), (25.0, 756.042), (451.07179999999994, 756.042), (451.07179999999994, 666.827)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'file_directory': '~/Downloads', 'filename': 'invalid-pdf-structure-pdfminer-one-page.pdf', 'languages': ['eng'], 'last_modified': '2024-08-12T17:31:09', 'page_number': 1, 'filetype': 'application/pdf', 'parent_id': 'c1757fc3797d44f224a3cb1e57864016'}

Kain-90 · 2024-08-13T01:40:19Z

I suddenly realized that the attribute filesize is not fetched via PyMuPDFLoader, but is returned via this method when uploading the file, so what other richer metadata seems to be not being used? Or am I looking in the wrong place.

def merge_chunks_local(file_name, total_chunks, chunk_dir, merged_dir):

    if not os.path.exists(merged_dir):
        os.mkdir(merged_dir)
    logging.info(f"Merged File Path: {merged_dir}")
    merged_file_path = os.path.join(merged_dir, file_name)
    with open(merged_file_path, "wb") as write_stream:
        for i in range(1, total_chunks + 1):
            chunk_file_path = os.path.join(chunk_dir, f"{file_name}_part_{i}")
            logging.info(f"Chunk File Path While Merging Parts:{chunk_file_path}")
            with open(chunk_file_path, "rb") as chunk_file:
                shutil.copyfileobj(chunk_file, write_stream)
            os.unlink(chunk_file_path)  # Delete the individual chunk file after merging
    logging.info("Chunks merged successfully and return file size")
    file_name, pages, file_extension = get_documents_from_file_by_path(
        merged_file_path, file_name
    )
    pdf_total_pages = pages[0].metadata["total_pages"]
    file_size = os.path.getsize(merged_file_path)
    return pdf_total_pages, file_size

Kain-90 · 2024-08-13T02:01:05Z

And for the PDF file type, the following code will make it simpler to obtain the page number.

def load_document_content(file_path):
    if Path(file_path).suffix.lower() == ".pdf":
        return UnstructuredFileLoader(
            file_path, mode="paged", autodetect_encoding=True
        )
    ...

loader = load_document_content(...)
page_number = len(loader.load())

kartikpersistent · 2024-09-02T12:06:49Z

@aashipandya

kartikpersistent added the question Further information is requested label Aug 12, 2024

kartikpersistent assigned aashipandya Aug 12, 2024

Kain-90 closed this as completed Aug 13, 2024

Kain-90 reopened this Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Kain-90 commented Aug 12, 2024

aashipandya commented Aug 12, 2024

Kain-90 commented Aug 13, 2024 •

edited

Loading

Kain-90 commented Aug 13, 2024

Kain-90 commented Aug 13, 2024

kartikpersistent commented Sep 2, 2024

Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Why use a loader specifically for the pdf type instead of using unstructuredFileLoader? #680

Comments

Kain-90 commented Aug 12, 2024

aashipandya commented Aug 12, 2024

Kain-90 commented Aug 13, 2024 • edited Loading

Kain-90 commented Aug 13, 2024

Kain-90 commented Aug 13, 2024

kartikpersistent commented Sep 2, 2024

Kain-90 commented Aug 13, 2024 •

edited

Loading