Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Utils: optimise get_page_layout #5

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

karlowich
Copy link

Since the existing code overwrites layout and dim in each iteration, it is much more efficient to simply return the layout and dim of the first page.

I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.

Since the existing code overwrites `layout` and `dim` in each iteration,
it is much more efficient to simply return the `layout` and `dim` of the
first page.

I have tested the difference with a 455 page pdf and the optimisation
reduces the time spent from 50 to 5 seconds.

Signed-off-by: Karl Bonde Torp <[email protected]>
@foarsitter
Copy link
Collaborator

foarsitter commented Feb 27, 2024

The current behavior indeed does not make any sense.

If it only returns the layout of the first page then I would suggest to rename it to get_layout_from_first_page or something similar.

The function is called in the _save_page method from PDFHandler. _save_page is called for each page. So I suggest adding cache for PDFHandler to.

Example:

from functools import lru_cache

class PDFHandler:
   ...
    @lru_cache
    def get_layout_from_first_page(self):
        return get_page_layout(self.filepath)

What do you think?

An extra bonus would be to add type hinting for the method. But that is something we didn't discus yet with the team.

@karlowich
Copy link
Author

@foarsitter I looked at this code 2 years ago. To be honest I can't recall much about the code base, feel free to modify this PR in any way you see fitting :)

Copy link
Collaborator

@foarsitter foarsitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I approve this PR

@MartinThoma
Copy link
Member

@foarsitter @bosd How do you want to handle merging?

I would suggest that one of you takes the lead and merges (in general, not only for this PR). People need to see that this repo is active and changes get merged.

From my experience with taking over the maintenance of pypdf, my priorities would be:

  1. Ensure the test suite covers the important parts + CI runs / blocks merging if it fails
  2. Merge many things rather quickly in the beginning to show people that this repo is active again.
  3. Once the first batch of changes get lower, apply higher standards and potentially take more time discussing PRs.

One thing that you might want to use is the fact that pypdf-table-extraction is still in 0.x version. So a bit of instability might be ok.

Having said that: That's your project :-) I'm here to support, but it's fine for me if you develop/run pypdf-table-extraction in a different way :-)

@bosd
Copy link
Collaborator

bosd commented Feb 27, 2024

How do you want to handle merging?
Good question. Ideally I would like to see multiple approvals on a PR.
Yet, even in some bigger projects who enforce a 2 approval system, I sadly see PR's go stale.
So, my take on this would be rule of at least 1 approving review and passing tests.

I'm with you on all 3 points.

In regards to this PR. I have looked at the code changes, LGTM. Did'nt press approve of merge button due to no test mechanisms in place.
As you mentiond, it is 0 version So preferring speed of implementation..

Implementing the (cookiecutter) / repo setup is a priority to me.
Ironically, the earlier mentioned awesome-python cookiecutter template is in need of some maintainer ❤️ as well.
(Support for recent python versions have not been added.)

@bosd
Copy link
Collaborator

bosd commented Feb 27, 2024

@karlowich Thanks for this PR. Speed results are impressive.

Would this change, alter the functionality?
E.g. A PDF with an A4 size cover page and with follow up pages on A3 size which contain the tables.
Will the pages after the cover page still be parsed at the A3 size?

@karlowich
Copy link
Author

@bosd Yes this will change the functionality slightly. Before, the code would overwrite the dimensions for every page and ultimately return the dimensions of the last page. My code returns the dimensions of the first page.

Neither can correctly handle PDFs with different size pages.

@foarsitter
Copy link
Collaborator

@bosd this PR uses the layout of the first page instead of the last page. Either way it will be a problem if a pdf contains various sizes.

To be backwards compatible we need to use the last item of the collection I guess.

@bosd
Copy link
Collaborator

bosd commented Feb 28, 2024

Neither can correctly handle PDFs with different size pages.

Thanks for the clarification. The support for handling multiple pages sizes might be something for a future PR.
This one is ok for me now.

To be backwards compatible we need to use the last item of the collection I guess.

I don't see why this would cause backward compatability options. LGTM now.

@foarsitter
Copy link
Collaborator

I don't see why this would cause backward compatibility options. LGTM now.

Because we are moving from last page to first page, but that IMHO is a minor detail.

I will merge it :)

Thanks @karlowich for submitting your PR and your quick responses!

@foarsitter foarsitter merged commit 706ea1a into py-pdf:main Feb 28, 2024
@bosd bosd added the performance Performance label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants