Repeating characters #71

samkit-jain · 2018-07-31T05:17:38Z

I'm facing a weird problem wherein characters are repeated when using extract_text() or extract_tables(). Example, SSttaatteemmeenntt ooff AAccccoouunnttss is printed instead of Statement of Accounts.

Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via extract_text(x_tolerance=0, y_tolerance=0) but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.

Lines are also repeated. Example,

Year-to-date totals do not reflect any fee or interest refunds
Year-to-date totals do not reflect any fee or interest refunds
you may have received.
you may have received.

The text was updated successfully, but these errors were encountered:

samkit-jain · 2018-07-31T06:50:11Z

On doing first_page.extract_words(x_tolerance=0, y_tolerance=0), there are two instances of a single word

{'x0': Decimal('231.532'), 'x1': Decimal('252.251'), 'top': Decimal('916.343'), 'bottom': Decimal('925.422'), 'text': 'reflect'}
{'x0': Decimal('231.533'), 'x1': Decimal('252.252'), 'top': Decimal('916.383'), 'bottom': Decimal('925.462'), 'text': 'reflect'}

And repeating characters are still present for some words,

{'x0': Decimal('489.040'), 'x1': Decimal('506.160'), 'top': Decimal('269.320'), 'bottom': Decimal('277.480'), 'text': 'ttooddaayy'}

jsvine · 2018-08-01T00:37:00Z

That's strange, indeed. My hunch is that there really are two copies of each letter in the PDF. (One set of letters might be transparent, perhaps?) What happens if you try extracting the text with another tool, such as poppler-utils's pdftotext? (https://en.wikipedia.org/wiki/Pdftotext)

samkit-jain · 2018-08-01T05:03:05Z

No such problem with pdftotext. This is the output,

No repeating lines

Year-to-date totals do not reflect any fee or interest refunds
you may have received.

No repeating characters

today

Statement of Accounts

jsfenfen · 2018-08-01T16:46:53Z

I've encountered this problem as well. In my case it was cropping up in fillable pdfs, and I theorized that the folks filling out the pdf were somehow resaving it on top of the original text. I found it was easier to just remove duplicate characters via script than make sense of the pdf. I dunno for sure, I suspect that other pdf output tools are removing duplicate characters.

I'm not really sure what the right solution is, but possibly adding a 'remove duplicate characters' option would make this more manageable? My case involved exact matches--characters occurring at exactly the same spot--so a fix was easy... I suppose if they were slightly offset it would be more challenging.

NaveenBandi · 2019-04-11T05:53:09Z

Getting same issue, please pass some resolution

BryanKoo · 2020-08-21T18:38:38Z

AFAIK, duplicated characters are also for bold representation and there will be cases with small offset.
Deduplication is possible by checking overlap ratio of all characters using coordinates.

tiagosamaha · 2020-08-24T11:01:03Z

Any solution to it? I have the same issue.

hannylicious · 2020-08-31T17:36:01Z

I recently stumbled across this issue - just tossing it out there to let folks know it's a continuing thing.

samkit-jain · 2020-08-31T18:23:51Z

@hannylicious and other watchers of this issue, if you have a PDF with this issue that you can share publicly, please do so that this issue can be investigated in further detail.

I am pretty sure I have a PDF with this issue but it will take me some time to find it.

hannylicious · 2020-08-31T19:17:38Z

Unfortunately - I dabble with PDF's very infrequently. I just happened across it this time because another library (pyPDF2) didn't see any text at all - whereas pdfplumber saw the text, but it was duplicated. The PDF I'm working with at this time has some information that I can't publicly display so I won't be of much assistance I'm afraid.

I resolved my use case simply by grabbing the first of the results and using that.

Pdfplumber is a great tool - I will most likely be using this from now on! If I run across this issue on a PDF that I can link up - I definitely will!

jsvine · 2020-09-01T02:04:37Z

Thanks, @hannylicious! If you have the time, you could try using https://github.com/JoshData/pdf-redactor to remove the sensitive information without altering the PDF structure. If the result still produces the same character-duplication, then it could be very helpful for resolving this issue.

hannylicious · 2020-09-01T20:23:35Z

Thanks @jsvine - I will definitely have a look at that pdf-redactor library. If it works - I'll be sure and post that PDF here!

tiagosamaha · 2020-09-01T20:31:15Z

I would like to help, but my file has confidential content. Anyone have some issue file?

pajaskowiak · 2020-09-11T20:08:24Z

Same issue here.

jsvine · 2020-09-26T17:05:02Z

@pajaskowiak Can you share a PDF that demonstrates the issue?

xv44586 · 2020-09-28T03:19:43Z

repeat.pdf
Getting samge issue, the pdf is repeat.pdf

mkl-public · 2020-09-28T07:43:20Z

The duplicate text indeed is drawn twice in the PDF, the second time with a small horizontal offset to create the appearance of a bold font.
Actually, though, this PDF gives a hint that the second copy shall be ignored by marking it with an empty ActualText property. By evaluating that property, therefore, pdfplumber could correctly extract this PDF.

jsvine · 2020-09-29T02:06:55Z

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

mkl-public · 2020-09-29T07:00:33Z

Indeed, there are many PDFs out there drawing text twice for some visual effect (bold, shadow, ...) but by far not all of them use ActualText to mark one copy as ignorable like @xv44586's example file does. Thus, finding duplicates explicitly will help more often in this regard than checking the ActualText.

pajaskowiak · 2020-09-29T12:42:41Z

@pajaskowiak Can you share a PDF that demonstrates the issue?

I'm really sorry but I can't. It contains sensitive information.

pajaskowiak · 2020-09-29T12:44:33Z

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

I did something similar to this. Anyways, I could fix the duplicates in my own code. Having the text from the pdf, even with eventual duplicates is a big help already! Thank you for the project!

@xv44586

h/t @xv44586 for the initial inspiration 👍 These new methods return a version of the chars/page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed.

jsvine · 2020-10-03T16:17:28Z

Commit 04fd56a (available in develop and in the next release) provides a Page.dedupe_chars(...) method that should address this general type of character duplication. (Thanks to @xv44586 for the PDF and test!) I'm closing this issue for now, but if anyone encounters character-duplication issues that the new method does not solve, feel free to comment on this thread. Priority will be given to comments containing a specific PDF and code that demonstrate the problem.

When using pdfplumber, some documents may be parsed incorrectly, resulting in duplicated characters. Add `dedupe` paramter for dedupe duplicated characters. Refer the Issue#71 of pdfplumber: jsvine/pdfplumber#71

…ader` (#10165) (Reopen PR #7706, hope this problem can fix.) When using `pdfplumber`, some documents may be parsed incorrectly, resulting in **duplicated characters**. Taking the [linked](https://bruusgaard.no/wp-content/uploads/2021/05/Datasheet1000-series.pdf) document as an example: ## Before ```python from langchain.document_loaders import PDFPlumberLoader pdf_file = 'file.pdf' loader = PDFPlumberLoader(pdf_file) docs = loader.load() print(docs[0].page_content) ``` Results: ``` 11000000 SSeerriieess PPoorrttaabbllee ssiinnggllee ggaass ddeetteeccttoorrss ffoorr HHyyddrrooggeenn aanndd CCoommbbuussttiibbllee ggaasseess TThhee RRiikkeenn KKeeiikkii GGPP--11000000 iiss aa ccoommppaacctt aanndd lliigghhttwweeiigghhtt ggaass ddeetteeccttoorr wwiitthh hhiigghh sseennssiittiivviittyy ffoorr tthhee ddeetteeccttiioonn ooff hhyyddrrooccaarrbboonnss.. TThhee mmeeaassuurreemmeenntt iiss ppeerrffoorrmmeedd ffoorr tthhiiss ppuurrppoossee bbyy mmeeaannss ooff ccaattaallyyttiicc sseennssoorr.. TThhee GGPP--11000000 hhaass aa bbuuiilltt--iinn ppuummpp wwiitthh ppuummpp bboooosstteerr ffuunnccttiioonn aanndd aa ddiirreecctt sseelleeccttiioonn ffrroomm aa lliisstt ooff 2255 hhyyddrrooccaarrbboonnss ffoorr eexxaacctt aalliiggnnmmeenntt ooff tthhee ttaarrggeett ggaass -- OOnnllyy ccaalliibbrraattiioonn oonn CCHH iiss nneecceessssaarryy.. 44 FFeeaattuurreess TThhee RRiikkeenn KKeeiikkii 110000vvvvttaabbllee ssiinnggllee HHyyddrrooggeenn aanndd CCoommbbuussttiibbllee ggaass ddeetteeccttoorrss.. TThheerree aarree 33 ssttaannddaarrdd mmooddeellss:: GGPP--11000000:: 00--1100%%LLEELL // 00--110000%%LLEELL ›› LLEELL ddeetteeccttoorr NNCC--11000000:: 00--11000000ppppmm // 00--1100000000ppppmm ›› PPPPMM ddeetteeccttoorr DDiirreecctt rreeaaddiinngg ooff tthhee ccoonncceennttrraattiioonn vvaalluueess ooff ccoommbbuussttiibbllee ggaasseess ooff 2255 ggaasseess ((55 NNPP--11000000)).. EEaassyy ooppeerraattiioonn ffeeaattuurree ooff cchhaannggiinngg tthhee ggaass nnaammee ddiissppllaayy wwiitthh 11 sswwiittcchh bbuuttttoonn.. LLoonngg ddiissttaannccee ddrraawwiinngg ppoossssiibbllee wwiitthh tthhee ppuummpp bboooosstteerr ffuunnccttiioonn.. VVaarriioouuss ccoommbbuussttiibbllee ggaasseess ccaann bbee mmeeaassuurreedd bbyy tthhee ppppmm oorrddeerr wwiitthh NNCC--11000000.. www.bruusgaard.no [email protected] +47 67 54 93 30 Rev: 446-2 ``` We can see that there are a large number of duplicated characters in the text, which can cause issues in subsequent applications. ## After Therefore, based on the [solution](jsvine/pdfplumber#71) provided by the `pdfplumber` source project. I added the `"dedupe_chars()"` method to address this problem. (Just pass the parameter `dedupe` to `True`) ```python from langchain.document_loaders import PDFPlumberLoader pdf_file = 'file.pdf' loader = PDFPlumberLoader(pdf_file, dedupe=True) docs = loader.load() print(docs[0].page_content) ``` Results: ``` 1000 Series Portable single gas detectors for Hydrogen and Combustible gases The Riken Keiki GP-1000 is a compact and lightweight gas detector with high sensitivity for the detection of hydrocarbons. The measurement is performed for this purpose by means of catalytic sensor. The GP-1000 has a built-in pump with pump booster function and a direct selection from a list of 25 hydrocarbons for exact alignment of the target gas - Only calibration on CH is necessary. 4 Features The Riken Keiki 100vvtable single Hydrogen and Combustible gas detectors. There are 3 standard models: GP-1000: 0-10%LEL / 0-100%LEL › LEL detector NC-1000: 0-1000ppm / 0-10000ppm › PPM detector Direct reading of the concentration values of combustible gases of 25 gases (5 NP-1000). Easy operation feature of changing the gas name display with 1 switch button. Long distance drawing possible with the pump booster function. Various combustible gases can be measured by the ppm order with NC-1000. www.bruusgaard.no [email protected] +47 67 54 93 30 Rev: 446-2 ``` --------- Co-authored-by: Bagatur <[email protected]>

samkit-jain added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Aug 21, 2020

xv44586 mentioned this issue Sep 29, 2020

fix duplicates in extract_text/extract_words/extract_tables #280

Closed

jsvine closed this as completed Oct 3, 2020

samkit-jain mentioned this issue Oct 20, 2020

Troubles with subscripts #292

Closed

jsvine removed the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Apr 27, 2022

Lin-jun-xiang mentioned this issue Jul 14, 2023

Fix the duplicate characters when using pdfplumber loader langchain-ai/langchain#7706

Closed

Lin-jun-xiang mentioned this issue Sep 4, 2023

Fix: the duplicate characters wrong results when using pdfplumber loader langchain-ai/langchain#10165

Merged

felix-hh mentioned this issue Mar 14, 2024

Custom deduppe_chars char properties #1114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeating characters #71

Repeating characters #71

samkit-jain commented Jul 31, 2018 •

edited

Loading

samkit-jain commented Jul 31, 2018

jsvine commented Aug 1, 2018

samkit-jain commented Aug 1, 2018

jsfenfen commented Aug 1, 2018

NaveenBandi commented Apr 11, 2019

BryanKoo commented Aug 21, 2020

tiagosamaha commented Aug 24, 2020

hannylicious commented Aug 31, 2020

samkit-jain commented Aug 31, 2020

hannylicious commented Aug 31, 2020

jsvine commented Sep 1, 2020

hannylicious commented Sep 1, 2020

tiagosamaha commented Sep 1, 2020

pajaskowiak commented Sep 11, 2020

jsvine commented Sep 26, 2020

xv44586 commented Sep 28, 2020

mkl-public commented Sep 28, 2020

jsvine commented Sep 29, 2020 •

edited

Loading

mkl-public commented Sep 29, 2020

pajaskowiak commented Sep 29, 2020

pajaskowiak commented Sep 29, 2020

jsvine commented Oct 3, 2020

Repeating characters #71

Repeating characters #71

Comments

samkit-jain commented Jul 31, 2018 • edited Loading

samkit-jain commented Jul 31, 2018

jsvine commented Aug 1, 2018

samkit-jain commented Aug 1, 2018

jsfenfen commented Aug 1, 2018

NaveenBandi commented Apr 11, 2019

BryanKoo commented Aug 21, 2020

tiagosamaha commented Aug 24, 2020

hannylicious commented Aug 31, 2020

samkit-jain commented Aug 31, 2020

hannylicious commented Aug 31, 2020

jsvine commented Sep 1, 2020

hannylicious commented Sep 1, 2020

tiagosamaha commented Sep 1, 2020

pajaskowiak commented Sep 11, 2020

jsvine commented Sep 26, 2020

xv44586 commented Sep 28, 2020

mkl-public commented Sep 28, 2020

jsvine commented Sep 29, 2020 • edited Loading

mkl-public commented Sep 29, 2020

pajaskowiak commented Sep 29, 2020

pajaskowiak commented Sep 29, 2020

jsvine commented Oct 3, 2020

samkit-jain commented Jul 31, 2018 •

edited

Loading

jsvine commented Sep 29, 2020 •

edited

Loading