diff --git a/docs/parsing-tables/overview.md b/docs/parsing-tables/overview.md index 62957c3..9113973 100644 --- a/docs/parsing-tables/overview.md +++ b/docs/parsing-tables/overview.md @@ -1,9 +1,23 @@ -The ability to automatically identify and extract tables from PDF documents is a highly sought-after feature. +Automatically identifying and extracting tables from PDF documents is a highly desirable feature that many people are looking for. It's an exciting and active area of research, and our goal is to provide the community with access to the most effective tools available. -This is an active area of research and we aim to expose the best available tools to the community. This is a blend of newer deep learning approaches and traditional bounding box-based methods. We aim to be parsing algorithm agnostic and allow users to choose the method that best suits their needs. +**By default this is turned off.** Parsing tables adds significant computational overhead, so we've made it optional. + +We're expose both cutting-edge deep learning techniques with traditional bounding box-based methods. Our approach is designed to be flexible, allowing users to select the parsing algorithm that works best for their specific requirements. + +At the moment, we offer three options for extracting tables from PDFs: `unitable`, `pymupdf`, and `table-transformer`. Each of these methods has its own unique advantages and limitations, so you can choose the one that aligns with your needs. + +```python hl_lines="2" +parser = openparse.DocumentParser( + table_args={...} +) + +# ingesting the document +parsed_10k = parser.parse(meta10k_path) +``` -Currently, we support three methods for extracting tables from PDFs. ## Become a Contributor? -- If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive. We are looking to optimize the model for inference and reduce the size of the model. +!!! note "Become a Contributor" + + - If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive. We are looking to optimize the model for inference and reduce the size of the model. diff --git a/docs/parsing-tables/pymupdf.md b/docs/parsing-tables/pymupdf.md index 36c11f8..86d0a3b 100644 --- a/docs/parsing-tables/pymupdf.md +++ b/docs/parsing-tables/pymupdf.md @@ -4,17 +4,13 @@ With version 1.23.0, PyMuPDF has added table recognition and extraction faciliti We find it tends to work well on dense tables, with a relatively simple structure. It's also very fast. -```python -# Arguments follow the following schema -class PyMuPDFArgsDict(TypedDict, total=False): - parsing_algorithm: Literal["pymupdf"] - table_output_format: Literal["markdown", "html"] -``` - -The following arguments are supported: +### Parameters: +| Name | Type | Description | Default | +|-----------------------|-----------|-----------------------------------------------------------------------------|---------| +| parsing_algorithm | `Literal['unitable']` | The library used for parsing, in this case, unitable. | None | +| min_table_confidence | `float` | The minimum confidence score for a table to be extracted. Default to 0.75. | 0.75 | +| table_output_format | `Literal['html']` | The format of the extracted tables. Currently only support html. | 'html' | -- `parsing_algorithm` specifies the library used for parsing, in this case, `pymupdf`. -- `table_output_format` specifies the format of the extracted tables. ### Example diff --git a/docs/parsing-tables/table-transformers.md b/docs/parsing-tables/table-transformers.md index 907c071..20ee2c5 100644 --- a/docs/parsing-tables/table-transformers.md +++ b/docs/parsing-tables/table-transformers.md @@ -5,22 +5,15 @@ Table Transformers is a deep learning approach to table detection and extraction We find it works well on tables with more complex structures and significant whitespace. -```python -# Arguments follow the following schema -class TableTransformersArgsDict(TypedDict, total=False): - parsing_algorithm: Literal["table-transformers"] - min_table_confidence: float - min_cell_confidence: float - table_output_format: Literal["markdown", "html"] - -``` +## Parameters -The following arguments are supported: +| Name | Type | Description | Default | +|----------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------|---------| +| `parsing_algorithm` | `Literal["table-transformers"]` | The library used for parsing, in this case, table-transformers. | None | +| `min_table_confidence` | `float` | The minimum confidence score for a table to be extracted. | None | +| `min_cell_confidence` | `float` | The minimum confidence score for a cell to be extracted. | None | +| `table_output_format` | `Literal["markdown", "html"]` | The format of the extracted tables. Supports both markdown and html. | None | -- `parsing_algorithm` specifies the library used for parsing, in this case, `table-transformers`. -- `min_table_confidence` specifies the minimum confidence score for a table to be extracted. -- `min_cell_confidence` specifies the minimum confidence score for a cell to be extracted. -- `table_output_format` specifies the format of the extracted tables. ### Example diff --git a/docs/parsing-tables/unitable.md b/docs/parsing-tables/unitable.md index fae6e9c..a12e92f 100644 --- a/docs/parsing-tables/unitable.md +++ b/docs/parsing-tables/unitable.md @@ -16,22 +16,14 @@ $ openparse-download Which will download the weights. They're about 1.5GB in size. -## Usage +## Parameters -```python -# Arguments follow the following schema -class UnitableArgsDict(TypedDict, total=False): - parsing_algorithm: Literal["unitable"] - min_table_confidence: float - table_output_format: Literal["html"] - -``` - -The following arguments are supported: +| Name | Type | Description | Default | +|----------------------|-------------------------|-----------------------------------------------------------------------------|---------| +| `parsing_algorithm` | `Literal["unitable"]` | The library used for parsing, in this case, unitable. | None | +| `min_table_confidence` | `float` | The minimum confidence score for a table to be extracted. | 0.75 | +| `table_output_format` | `Literal["html"]` | The format of the extracted tables. Currently only support html. | None | -- `parsing_algorithm` specifies the library used for parsing, in this case, `unitable`. -- `min_table_confidence` specifies the minimum confidence score for a table to be extracted. Default to 0.75. -- `table_output_format` specifies the format of the extracted tables. Currently only suport html. ### Example diff --git a/docs/parsing-text/overview.md b/docs/processing/ocr.md similarity index 79% rename from docs/parsing-text/overview.md rename to docs/processing/ocr.md index 666786b..442cb62 100644 --- a/docs/parsing-text/overview.md +++ b/docs/processing/ocr.md @@ -1,5 +1,3 @@ -Text processing is how we extract textual elements from within a doc and convert it to Markdown. The output are Nodes that represent distinct parts of the layout - like a heading or paragraph. - ### 1. Default Text Processing with PdfMiner Use PdfMiner if your documents are text-heavy, well-structured, and do not contain non-text elements that require OCR. diff --git a/docs/processing/overview.md b/docs/processing/overview.md index 72dc5d0..cec5505 100644 --- a/docs/processing/overview.md +++ b/docs/processing/overview.md @@ -2,10 +2,14 @@ Processing is how we group related elements together to form a coherent structure. The output are Nodes that represent distinct sections of the document. + + ## 1. Default Processing By default, we use a simple heuristic to group elements together. This works well for many documents. +These are mostly just commmon sense transforms - a heading should be grouped with the text that follows it, a bullet list should be grouped together, etc. + ```python from openparse import DocumentParser @@ -14,9 +18,9 @@ parser = DocumentParser() ## 2. Semantic Processing (Recommended) -Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.** +Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.** -We currently only support the OpenAI API to generate embeddings. +We currently only support the OpenAI API to generate embeddings but plan on adding more options soon. ```python from openparse import processing, DocumentParser @@ -33,9 +37,11 @@ parser = DocumentParser( parsed_content = parser.parse(basic_doc_path) ``` +If you're interested in understand how this works, you can see a demo notebook [here](https://github.com/Filimoa/open-parse/blob/main/src/cookbooks/semantic_processing.ipynb). + #### Notes on Node Size: -We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes to provider bettter context for the model. +We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes perform better. A more thorough discussion can be found [here](https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5). diff --git a/docs/visualization.md b/docs/visualization.md index ee87da7..7b3bfc9 100644 --- a/docs/visualization.md +++ b/docs/visualization.md @@ -12,7 +12,10 @@ parsed_basic_doc = parser.parse(basic_doc_path) for node in parsed_basic_doc.nodes: display(node) ``` - +
+

+ +

You can also display the results directly overlayed on the original pdf. ```python