Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When the layout model recognizes boxes with overlapping parts, it can lead to a large number of pdf text loss #1328

Open
1 task done
papandadj opened this issue Jul 1, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@papandadj
Copy link

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch name

any

Commit ID

any

Other environment information

No response

Actual behavior

Recently, I discovered that there is a significant data loss issue when parsing PDF files. After debugging the related files, I found that when the layout recognizes many boxes, if boxes a and b have containment conditions and do not meet condition

if Recognizer.overlapped_area(layouts[i], layouts[j]) < thr \
       and Recognizer.overlapped_area(layouts[j], layouts[i]) < thr

it leads to a large amount of data being deleted.

for example:
the layout recognizes is that:
image

The two red boxes overlap, but they do not meet the above condition, which will result in the box with the lower score being deleted.
image

I found that other people seem to have encountered this problem as well, but I’m not sure if it’s the same issue.
#1057

The overlapping ratio of the small red boxes on the surface is very close to 1.

Would changing the less-than sign to a greater-than sign here be better? this is to said If the overlapping part occupies the vast majority of both boxes’ areas, then one of them needs to be deleted.

like this:

  if Recognizer.overlapped_area(layouts[i], layouts[j]) > thr \
         or Recognizer.overlapped_area(layouts[j], layouts[i]) > thr:
      Delete the box with the smaller area 
  else:
      i += 1
      continue

BUT would this cause a lot of data duplication? I’m not sure if the layout model filters out such cases.”

Expected behavior

No response

Steps to reproduce

null

Additional information

No response

@papandadj papandadj added the bug Something isn't working label Jul 1, 2024
@cyhasuka
Copy link

cyhasuka commented Jul 1, 2024

I've got the same issues.

@awesomeboy2
Copy link

same issues

@KevinHuSh
Copy link
Collaborator

It will not lead to the loss of text box. Only when the text boxes being identified as reference/header/footer will be removed.

@papandadj
Copy link
Author

I don’t think it’s as you said. In the code, when two boxes overlap, with a large box covering a small box, if the score of the small box is higher than that of the large box, it will cause the large box to be deleted.

test2.pdf

image

Could you debug this file and set a breakpoint at the line marked with a red line to test it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants