[MRG] Utils: optimise get_page_layout #5

karlowich · 2024-02-27T11:32:24Z

Since the existing code overwrites layout and dim in each iteration, it is much more efficient to simply return the layout and dim of the first page.

I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.

Since the existing code overwrites `layout` and `dim` in each iteration, it is much more efficient to simply return the `layout` and `dim` of the first page. I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds. Signed-off-by: Karl Bonde Torp <[email protected]>

foarsitter · 2024-02-27T12:45:54Z

The current behavior indeed does not make any sense.

If it only returns the layout of the first page then I would suggest to rename it to get_layout_from_first_page or something similar.

The function is called in the _save_page method from PDFHandler. _save_page is called for each page. So I suggest adding cache for PDFHandler to.

Example:

from functools import lru_cache

class PDFHandler:
   ...
    @lru_cache
    def get_layout_from_first_page(self):
        return get_page_layout(self.filepath)

What do you think?

An extra bonus would be to add type hinting for the method. But that is something we didn't discus yet with the team.

karlowich · 2024-02-27T12:57:21Z

@foarsitter I looked at this code 2 years ago. To be honest I can't recall much about the code base, feel free to modify this PR in any way you see fitting :)

foarsitter

In that case I approve this PR

MartinThoma · 2024-02-27T20:00:41Z

@foarsitter @bosd How do you want to handle merging?

I would suggest that one of you takes the lead and merges (in general, not only for this PR). People need to see that this repo is active and changes get merged.

From my experience with taking over the maintenance of pypdf, my priorities would be:

Ensure the test suite covers the important parts + CI runs / blocks merging if it fails
Merge many things rather quickly in the beginning to show people that this repo is active again.
Once the first batch of changes get lower, apply higher standards and potentially take more time discussing PRs.

One thing that you might want to use is the fact that pypdf-table-extraction is still in 0.x version. So a bit of instability might be ok.

Having said that: That's your project :-) I'm here to support, but it's fine for me if you develop/run pypdf-table-extraction in a different way :-)

bosd · 2024-02-27T20:24:50Z

How do you want to handle merging?
Good question. Ideally I would like to see multiple approvals on a PR.
Yet, even in some bigger projects who enforce a 2 approval system, I sadly see PR's go stale.
So, my take on this would be rule of at least 1 approving review and passing tests.

I'm with you on all 3 points.

In regards to this PR. I have looked at the code changes, LGTM. Did'nt press approve of merge button due to no test mechanisms in place.
As you mentiond, it is 0 version So preferring speed of implementation..

Implementing the (cookiecutter) / repo setup is a priority to me.
Ironically, the earlier mentioned awesome-python cookiecutter template is in need of some maintainer ❤️ as well.
(Support for recent python versions have not been added.)

bosd · 2024-02-27T20:33:20Z

@karlowich Thanks for this PR. Speed results are impressive.

Would this change, alter the functionality?
E.g. A PDF with an A4 size cover page and with follow up pages on A3 size which contain the tables.
Will the pages after the cover page still be parsed at the A3 size?

karlowich · 2024-02-28T07:17:39Z

@bosd Yes this will change the functionality slightly. Before, the code would overwrite the dimensions for every page and ultimately return the dimensions of the last page. My code returns the dimensions of the first page.

Neither can correctly handle PDFs with different size pages.

foarsitter · 2024-02-28T07:19:16Z

@bosd this PR uses the layout of the first page instead of the last page. Either way it will be a problem if a pdf contains various sizes.

To be backwards compatible we need to use the last item of the collection I guess.

bosd · 2024-02-28T07:35:08Z

Neither can correctly handle PDFs with different size pages.

Thanks for the clarification. The support for handling multiple pages sizes might be something for a future PR.
This one is ok for me now.

To be backwards compatible we need to use the last item of the collection I guess.

I don't see why this would cause backward compatability options. LGTM now.

foarsitter · 2024-02-28T07:38:42Z

I don't see why this would cause backward compatibility options. LGTM now.

Because we are moving from last page to first page, but that IMHO is a minor detail.

I will merge it :)

Thanks @karlowich for submitting your PR and your quick responses!

karlowich mentioned this pull request Feb 27, 2024

[MRG] Utils: optimise get_page_layout camelot-dev/camelot#297

Open

foarsitter approved these changes Feb 27, 2024

View reviewed changes

bosd approved these changes Feb 28, 2024

View reviewed changes

foarsitter merged commit 706ea1a into py-pdf:main Feb 28, 2024

bosd added the performance Performance label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Utils: optimise get_page_layout #5

[MRG] Utils: optimise get_page_layout #5

karlowich commented Feb 27, 2024

foarsitter commented Feb 27, 2024 •

edited

Loading

karlowich commented Feb 27, 2024

foarsitter left a comment

MartinThoma commented Feb 27, 2024

bosd commented Feb 27, 2024

bosd commented Feb 27, 2024

karlowich commented Feb 28, 2024

foarsitter commented Feb 28, 2024

bosd commented Feb 28, 2024

foarsitter commented Feb 28, 2024

[MRG] Utils: optimise get_page_layout #5

[MRG] Utils: optimise get_page_layout #5

Conversation

karlowich commented Feb 27, 2024

foarsitter commented Feb 27, 2024 • edited Loading

karlowich commented Feb 27, 2024

foarsitter left a comment

Choose a reason for hiding this comment

MartinThoma commented Feb 27, 2024

bosd commented Feb 27, 2024

bosd commented Feb 27, 2024

karlowich commented Feb 28, 2024

foarsitter commented Feb 28, 2024

bosd commented Feb 28, 2024

foarsitter commented Feb 28, 2024

foarsitter commented Feb 27, 2024 •

edited

Loading