improved docs

Filimoa · Apr 10, 2024 · bdf2055 · bdf2055
1 parent b8b5bcf
commit bdf2055
Show file tree

Hide file tree

Showing 7 changed files with 50 additions and 48 deletions.
diff --git a/docs/parsing-tables/overview.md b/docs/parsing-tables/overview.md
@@ -1,9 +1,23 @@
-The ability to automatically identify and extract tables from PDF documents is a highly sought-after feature.
+Automatically identifying and extracting tables from PDF documents is a highly desirable feature that many people are looking for. It's an exciting and active area of research, and our goal is to provide the community with access to the most effective tools available. 
 
-This is an active area of research and we aim to expose the best available tools to the community. This is a blend of newer deep learning approaches and traditional bounding box-based methods. We aim to be parsing algorithm agnostic and allow users to choose the method that best suits their needs.
+**By default this is turned off.** Parsing tables adds significant computational overhead, so we've made it optional.
+
+We're expose both cutting-edge deep learning techniques with traditional bounding box-based methods. Our approach is designed to be flexible, allowing users to select the parsing algorithm that works best for their specific requirements.
+
+At the moment, we offer three options for extracting tables from PDFs: `unitable`, `pymupdf`, and `table-transformer`. Each of these methods has its own unique advantages and limitations, so you can choose the one that aligns with your needs. 
+
+```python hl_lines="2"
+parser = openparse.DocumentParser(
+    table_args={...}
+)
+
+# ingesting the document
+parsed_10k = parser.parse(meta10k_path)
+```
 
-Currently, we support three methods for extracting tables from PDFs.
 
 ## Become a Contributor?
 
-- If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive.  We are looking to optimize the model for inference and reduce the size of the model.
+!!! note "Become a Contributor"
+
+    - If you have experience with quantizing models or optimizing them for inference, we would love to hear from you! Unitable achieves **state-of-the-art performance** on table extraction, but it is computationally expensive. We are looking to optimize the model for inference and reduce the size of the model.
diff --git a/docs/parsing-tables/pymupdf.md b/docs/parsing-tables/pymupdf.md
@@ -4,17 +4,13 @@ With version 1.23.0, PyMuPDF has added table recognition and extraction faciliti
 
 We find it tends to work well on dense tables, with a relatively simple structure. It's also very fast.
 
-```python
-# Arguments follow the following schema
-class PyMuPDFArgsDict(TypedDict, total=False):
-    parsing_algorithm: Literal["pymupdf"]
-    table_output_format: Literal["markdown", "html"]
-```
-
-The following arguments are supported:
+### Parameters:
+| Name                  | Type      | Description                                                                 | Default |
+|-----------------------|-----------|-----------------------------------------------------------------------------|---------|
+| parsing_algorithm     | `Literal['unitable']` | The library used for parsing, in this case, unitable.                       | None    |
+| min_table_confidence  | `float`   | The minimum confidence score for a table to be extracted. Default to 0.75.  | 0.75    |
+| table_output_format   | `Literal['html']` | The format of the extracted tables. Currently only support html.            | 'html'  |
 
-- `parsing_algorithm` specifies the library used for parsing, in this case, `pymupdf`.
-- `table_output_format` specifies the format of the extracted tables.
 
 ### Example
 

diff --git a/docs/parsing-tables/table-transformers.md b/docs/parsing-tables/table-transformers.md
@@ -5,22 +5,15 @@ Table Transformers is a deep learning approach to table detection and extraction
 
 We find it works well on tables with more complex structures and significant whitespace.
 
-```python
-# Arguments follow the following schema
-class TableTransformersArgsDict(TypedDict, total=False):
-    parsing_algorithm: Literal["table-transformers"]
-    min_table_confidence: float
-    min_cell_confidence: float
-    table_output_format: Literal["markdown", "html"]
-
-```
+## Parameters
 
-The following arguments are supported:
+| Name                 | Type                                 | Description                                                                                                  | Default |
+|----------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------|---------|
+| `parsing_algorithm`  | `Literal["table-transformers"]`      | The library used for parsing, in this case, table-transformers.                                              | None    |
+| `min_table_confidence` | `float`                             | The minimum confidence score for a table to be extracted.                                                    | None    |
+| `min_cell_confidence` | `float`                             | The minimum confidence score for a cell to be extracted.                                                     | None    |
+| `table_output_format` | `Literal["markdown", "html"]`       | The format of the extracted tables. Supports both markdown and html.                                         | None    |
 
-- `parsing_algorithm` specifies the library used for parsing, in this case, `table-transformers`.
-- `min_table_confidence` specifies the minimum confidence score for a table to be extracted.
-- `min_cell_confidence` specifies the minimum confidence score for a cell to be extracted.
-- `table_output_format` specifies the format of the extracted tables.
 
 ### Example
 

diff --git a/docs/parsing-tables/unitable.md b/docs/parsing-tables/unitable.md
@@ -16,22 +16,14 @@ $ openparse-download
 Which will download the weights. They're about 1.5GB in size.
 
 
-## Usage
+## Parameters
 
-```python
-# Arguments follow the following schema
-class UnitableArgsDict(TypedDict, total=False):
-    parsing_algorithm: Literal["unitable"]
-    min_table_confidence: float
-    table_output_format: Literal["html"]
-
-```
-
-The following arguments are supported:
+| Name                 | Type                    | Description                                                                 | Default |
+|----------------------|-------------------------|-----------------------------------------------------------------------------|---------|
+| `parsing_algorithm`  | `Literal["unitable"]`   | The library used for parsing, in this case, unitable.                       | None    |
+| `min_table_confidence` | `float`              | The minimum confidence score for a table to be extracted.                   | 0.75    |
+| `table_output_format` | `Literal["html"]`      | The format of the extracted tables. Currently only support html.            | None    |
 
-- `parsing_algorithm` specifies the library used for parsing, in this case, `unitable`.
-- `min_table_confidence` specifies the minimum confidence score for a table to be extracted. Default to 0.75.
-- `table_output_format` specifies the format of the extracted tables. Currently only suport html.
 
 
 ### Example

diff --git a/docs/parsing-text/overview.md → docs/processing/ocr.md b/docs/parsing-text/overview.md → docs/processing/ocr.md
@@ -1,5 +1,3 @@
-Text processing is how we extract textual elements from within a doc and convert it to Markdown. The output are Nodes that represent distinct parts of the layout - like a heading or paragraph. 
-
 ### 1. Default Text Processing with PdfMiner
 Use PdfMiner if your documents are text-heavy, well-structured, and do not contain non-text elements that require OCR.
 

diff --git a/docs/processing/overview.md b/docs/processing/overview.md
@@ -2,10 +2,14 @@
 
 Processing is how we group related elements together to form a coherent structure. The output are Nodes that represent distinct sections of the document.
 
+<img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/open-parse-architecture.png">
+
 ## 1. Default Processing 
 
 By default, we use a simple heuristic to group elements together. This works well for many documents.
 
+These are mostly just commmon sense transforms - a heading should be grouped with the text that follows it, a bullet list should be grouped together, etc. 
+
 ```python
 from openparse import DocumentParser
 
@@ -14,9 +18,9 @@ parser = DocumentParser()
 
 ## 2. Semantic Processing (Recommended)
 
-Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.** 
+Chunking documents is fundamentally about grouping similar semantic nodes together. Perhaps the most powerful way to do this is to use embeddings. **By embedding the text of each node, we can then cluster them together based on their similarity.**  
 
-We currently only support the OpenAI API to generate embeddings.
+We currently only support the OpenAI API to generate embeddings but plan on adding more options soon.
 
 ```python
 from openparse import processing, DocumentParser
@@ -33,9 +37,11 @@ parser = DocumentParser(
 parsed_content = parser.parse(basic_doc_path)
 ```
 
+If you're interested in understand how this works, you can see a demo notebook [here](https://github.com/Filimoa/open-parse/blob/main/src/cookbooks/semantic_processing.ipynb).
+
 #### Notes on Node Size:
 
-We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes to provider bettter context for the model.
+We have a bias towards chunking that results in larger nodes - models have increasingly large context windows and we find large nodes perform better.
 
 A more thorough discussion can be found [here](https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5).
 
diff --git a/docs/visualization.md b/docs/visualization.md
@@ -12,7 +12,10 @@ parsed_basic_doc = parser.parse(basic_doc_path)
 for node in parsed_basic_doc.nodes:
     display(node)
 ```
-
+<br/>
+<p align="center">
+    <img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/pretty-markdown-nodes.webp" width="650" />
+</p>
 You can also display the results directly overlayed on the original pdf.
 
 ```python