Skip to content

Commit

Permalink
improved docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Filimoa committed Apr 10, 2024
1 parent b8b5bcf commit bdf2055
Show file tree
Hide file tree
Showing 7 changed files with 50 additions and 48 deletions.
22 changes: 18 additions & 4 deletions docs/parsing-tables/overview.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,23 @@
The ability to automatically identify and extract tables from PDF documents is a highly sought-after feature.
Automatically identifying and extracting tables from PDF documents is a highly desirable feature that many people are looking for. It's an exciting and active area of research, and our goal is to provide the community with access to the most effective tools available.

This is an active area of research and we aim to expose the best available tools to the community. This is a blend of newer deep learning approaches and traditional bounding box-based methods. We aim to be parsing algorithm agnostic and allow users to choose the method that best suits their needs.
**By default this is turned off.** Parsing tables adds significant computational overhead, so we've made it optional.

We're expose both cutting-edge deep learning techniques with traditional bounding box-based methods. Our approach is designed to be flexible, allowing users to select the parsing algorithm that works best for their specific requirements.

At the moment, we offer three options for extracting tables from PDFs: `unitable`, `pymupdf`, and `table-transformer`. Each of these methods has its own unique advantages and limitations, so you can choose the one that aligns with your needs.

```python hl_lines="2"
parser = openparse.DocumentParser(
table_args={...}
)

# ingesting the document
parsed_10k = parser.parse(meta10k_path)
```

Currently, we support three methods for extracting tables from PDFs.

## Become a Contributor?

- If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive. We are looking to optimize the model for inference and reduce the size of the model.
!!! note "Become a Contributor"

- If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive. We are looking to optimize the model for inference and reduce the size of the model.
16 changes: 6 additions & 10 deletions docs/parsing-tables/pymupdf.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,13 @@ With version 1.23.0, PyMuPDF has added table recognition and extraction faciliti

We find it tends to work well on dense tables, with a relatively simple structure. It's also very fast.

```python
# Arguments follow the following schema
class PyMuPDFArgsDict(TypedDict, total=False):
parsing_algorithm: Literal["pymupdf"]
table_output_format: Literal["markdown", "html"]
```

The following arguments are supported:
### Parameters:
| Name | Type | Description | Default |
|-----------------------|-----------|-----------------------------------------------------------------------------|---------|
| parsing_algorithm | `Literal['unitable']` | The library used for parsing, in this case, unitable. | None |
| min_table_confidence | `float` | The minimum confidence score for a table to be extracted. Default to 0.75. | 0.75 |
| table_output_format | `Literal['html']` | The format of the extracted tables. Currently only support html. | 'html' |

- `parsing_algorithm` specifies the library used for parsing, in this case, `pymupdf`.
- `table_output_format` specifies the format of the extracted tables.

### Example

Expand Down
21 changes: 7 additions & 14 deletions docs/parsing-tables/table-transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,15 @@ Table Transformers is a deep learning approach to table detection and extraction

We find it works well on tables with more complex structures and significant whitespace.

```python
# Arguments follow the following schema
class TableTransformersArgsDict(TypedDict, total=False):
parsing_algorithm: Literal["table-transformers"]
min_table_confidence: float
min_cell_confidence: float
table_output_format: Literal["markdown", "html"]

```
## Parameters

The following arguments are supported:
| Name | Type | Description | Default |
|----------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------|---------|
| `parsing_algorithm` | `Literal["table-transformers"]` | The library used for parsing, in this case, table-transformers. | None |
| `min_table_confidence` | `float` | The minimum confidence score for a table to be extracted. | None |
| `min_cell_confidence` | `float` | The minimum confidence score for a cell to be extracted. | None |
| `table_output_format` | `Literal["markdown", "html"]` | The format of the extracted tables. Supports both markdown and html. | None |

- `parsing_algorithm` specifies the library used for parsing, in this case, `table-transformers`.
- `min_table_confidence` specifies the minimum confidence score for a table to be extracted.
- `min_cell_confidence` specifies the minimum confidence score for a cell to be extracted.
- `table_output_format` specifies the format of the extracted tables.

### Example

Expand Down
20 changes: 6 additions & 14 deletions docs/parsing-tables/unitable.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,14 @@ $ openparse-download
Which will download the weights. They're about 1.5GB in size.


## Usage
## Parameters

```python
# Arguments follow the following schema
class UnitableArgsDict(TypedDict, total=False):
parsing_algorithm: Literal["unitable"]
min_table_confidence: float
table_output_format: Literal["html"]

```

The following arguments are supported:
| Name | Type | Description | Default |
|----------------------|-------------------------|-----------------------------------------------------------------------------|---------|
| `parsing_algorithm` | `Literal["unitable"]` | The library used for parsing, in this case, unitable. | None |
| `min_table_confidence` | `float` | The minimum confidence score for a table to be extracted. | 0.75 |
| `table_output_format` | `Literal["html"]` | The format of the extracted tables. Currently only support html. | None |

- `parsing_algorithm` specifies the library used for parsing, in this case, `unitable`.
- `min_table_confidence` specifies the minimum confidence score for a table to be extracted. Default to 0.75.
- `table_output_format` specifies the format of the extracted tables. Currently only suport html.


### Example
Expand Down
2 changes: 0 additions & 2 deletions docs/parsing-text/overview.md → docs/processing/ocr.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
Text processing is how we extract textual elements from within a doc and convert it to Markdown. The output are Nodes that represent distinct parts of the layout - like a heading or paragraph.

### 1. Default Text Processing with PdfMiner
Use PdfMiner if your documents are text-heavy, well-structured, and do not contain non-text elements that require OCR.

Expand Down
12 changes: 9 additions & 3 deletions docs/processing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@

Processing is how we group related elements together to form a coherent structure. The output are Nodes that represent distinct sections of the document.

<img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/open-parse-architecture.png">

## 1. Default Processing

By default, we use a simple heuristic to group elements together. This works well for many documents.

These are mostly just commmon sense transforms - a heading should be grouped with the text that follows it, a bullet list should be grouped together, etc.

```python
from openparse import DocumentParser

Expand All @@ -14,9 +18,9 @@ parser = DocumentParser()

## 2. Semantic Processing (Recommended)

Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.**
Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.**

We currently only support the OpenAI API to generate embeddings.
We currently only support the OpenAI API to generate embeddings but plan on adding more options soon.

```python
from openparse import processing, DocumentParser
Expand All @@ -33,9 +37,11 @@ parser = DocumentParser(
parsed_content = parser.parse(basic_doc_path)
```

If you're interested in understand how this works, you can see a demo notebook [here](https://github.com/Filimoa/open-parse/blob/main/src/cookbooks/semantic_processing.ipynb).

#### Notes on Node Size:

We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes to provider bettter context for the model.
We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes perform better.

A more thorough discussion can be found [here](https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5).

5 changes: 4 additions & 1 deletion docs/visualization.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@ parsed_basic_doc = parser.parse(basic_doc_path)
for node in parsed_basic_doc.nodes:
display(node)
```

<br/>
<p align="center">
<img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/pretty-markdown-nodes.webp" width="650" />
</p>
You can also display the results directly overlayed on the original pdf.

```python
Expand Down

0 comments on commit bdf2055

Please sign in to comment.