Skip to content

Commit

Permalink
#13
Browse files Browse the repository at this point in the history
  • Loading branch information
Filimoa committed Apr 9, 2024
1 parent 1e6c6c8 commit 838ce27
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 42 deletions.
6 changes: 4 additions & 2 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

#### /PDF

This is really just a wrapper class around a pdfminer, pymupdf and pypdf. We implement some basic visualization / export methods. Would like to migrate away from pymupdf for converting pdfs to images due to its licensing.
This is really just a wrapper class around a `pdfminer`, `pymupdf` and `pypdf`. We implement some basic visualization / export methods. Would like to migrate away from `pymupdf` for converting pdfs to images due to its licensing.

#### /Schemas

Expand All @@ -18,7 +18,7 @@ This module implements basic text parsing along with basic markdown support.

We parse text into markdown by looking at the font size and style charachter by charachter. This gets combined into a span which represents a string of charachters with the same styling.

Spans get combined into lines and lines get combined into elements. Elements are the basic building blocks of the document. They can be headings, paragraphs, bullets, etc.
Spans get combined into lines and lines get combined into elements. Elements are the basic building blocks of the document. They can be headings, paragraphs, lists of bullets, etc.

Optionally we can use PyMuPDF to OCR the document. This is not recommended as a default due to the additional computational cost and inherent inaccuracies of OCR. We're looking at integrating [doctr](https://github.com/mindee/doctr).

Expand All @@ -36,6 +36,8 @@ Lastly unitable is our recommended approach for table extraction. It is a transf

We're also looking at speeding unitable up. This can either be done by quantizing the model or by using the smaller, 70M parameter model they released. Unfortunately, the smaller model was not fine tuned so this is holding us back from implementing it. You can see the published paper [here](https://arxiv.org/abs/2403.04822).

A ton of credit goes to the unitable team - they've done an amazing job making their research reproducible. You can find the original repository with full training code [here](https://github.com/poloclub/unitable).

## Processing Pipeline

#### /Processing
Expand Down
84 changes: 45 additions & 39 deletions src/cookbooks/unitable.ipynb

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion src/openparse/consts.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from typing import Literal

MAX_EMBEDDING_TOKENS = 8000
TOKENIZATION_LOWER_LIMIT = 256
TOKENIZATION_UPPER_LIMIT = 1024
COORDINATE_SYSTEM: Literal["top-left", "bottom-left"] = "bottom-left"
Expand Down

0 comments on commit 838ce27

Please sign in to comment.