-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Utils: optimise get_page_layout #5
Conversation
Since the existing code overwrites `layout` and `dim` in each iteration, it is much more efficient to simply return the `layout` and `dim` of the first page. I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds. Signed-off-by: Karl Bonde Torp <[email protected]>
The current behavior indeed does not make any sense. If it only returns the layout of the first page then I would suggest to rename it to The function is called in the Example: from functools import lru_cache
class PDFHandler:
...
@lru_cache
def get_layout_from_first_page(self):
return get_page_layout(self.filepath) What do you think? An extra bonus would be to add type hinting for the method. But that is something we didn't discus yet with the team. |
@foarsitter I looked at this code 2 years ago. To be honest I can't recall much about the code base, feel free to modify this PR in any way you see fitting :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I approve this PR
@foarsitter @bosd How do you want to handle merging? I would suggest that one of you takes the lead and merges (in general, not only for this PR). People need to see that this repo is active and changes get merged. From my experience with taking over the maintenance of pypdf, my priorities would be:
One thing that you might want to use is the fact that pypdf-table-extraction is still in 0.x version. So a bit of instability might be ok. Having said that: That's your project :-) I'm here to support, but it's fine for me if you develop/run pypdf-table-extraction in a different way :-) |
I'm with you on all 3 points. In regards to this PR. I have looked at the code changes, LGTM. Did'nt press approve of merge button due to no test mechanisms in place. Implementing the (cookiecutter) / repo setup is a priority to me. |
@karlowich Thanks for this PR. Speed results are impressive. Would this change, alter the functionality? |
@bosd Yes this will change the functionality slightly. Before, the code would overwrite the dimensions for every page and ultimately return the dimensions of the last page. My code returns the dimensions of the first page. Neither can correctly handle PDFs with different size pages. |
@bosd this PR uses the layout of the first page instead of the last page. Either way it will be a problem if a pdf contains various sizes. To be backwards compatible we need to use the last item of the collection I guess. |
Thanks for the clarification. The support for handling multiple pages sizes might be something for a future PR.
I don't see why this would cause backward compatability options. LGTM now. |
Because we are moving from last page to first page, but that IMHO is a minor detail. I will merge it :) Thanks @karlowich for submitting your PR and your quick responses! |
Since the existing code overwrites
layout
anddim
in each iteration, it is much more efficient to simply return thelayout
anddim
of the first page.I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.