Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain text when importing PDF's #668

Open
chron-isch opened this issue May 11, 2023 · 6 comments
Open

Retain text when importing PDF's #668

chron-isch opened this issue May 11, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@chron-isch
Copy link

Hey, I just stumbled upon rnote a couple of weeks ago and it's an amazing project. Thanks you for the work!

Is your feature request related to a problem? Please describe.
I went through great pain to scan and OCR almost every document/script/lecture note I own to make them searchable.
I usually take notes or highlight parts of those documents for later, but the moment I import them, they loose all OCR information/text and become nothing more than a fancy picture.
This is especially annoying with long lecture notes, since I can't search them anymore.

Describe the solution you'd like
Do not rasterize PDFs or maybe use the PDF as background/just reference the file like xournalpp does and somehow merge both notes and PDF on export? Any way that keeps the original information contained within the file alive is fine with me.

Describe alternatives you've considered
I considered just sending the exported PDFs through OCR again, but my handwriting/highlighting/doodling makes OCR more difficult and error prone.

Thank you!

@LeSnake04
Copy link
Contributor

LeSnake04 commented May 12, 2023

related to #153

I think it makes sense to discuss the feature here instead since 153 is already pretty bloated

@Kneemund Kneemund added the enhancement New feature or request label May 31, 2023
@flxzt flxzt changed the title Do not rasterize PDFs Retain text when importing PDF's Jun 30, 2023
@bamonroe
Copy link

This just bit me today. I presumed that importing a PDF would keep it as a vector graphic. However, the PDF was rasterized and a 2MB document became a 50MB document after exporting. Until this is fixed, I have to go back to Xournalpp, no one wants to email 50MB pdfs. I think there are lots of good suggestions in the other issue - using the imported pdf as a background, etc. It would be great to see some traction on this.

@flxzt
Copy link
Owner

flxzt commented Aug 1, 2023

#761 restores the functionality that Pdf pages are exported in a vectorized format (for Pdf and Svg export). Retaining Pdf text is a bit more complicated, but I'll look into it if it can be done somehow.

A dedicated Pdf annotation mode is something that I would like as well, but I will track the progress for that feature in the other issue.

@flxzt
Copy link
Owner

flxzt commented Dec 7, 2023

The reason why it currently is not supported is that the pdf page content is converted to Svg and simplified when it is imported as a vector image.

There are two main reasons for this: simplifying reduces the render workload in some cases, and more importantly: when simplifying, the glyphs are converted to Svg paths. If they would be retained, it is common that their ID's clash when the page images are combined when the document gets exported. This results in nonsensical text.

Another solution could be: parse the Svg but instead of simplifying it, only prepend all matching ID's with a random string. This way there wouldn't be any clashes but the original glyphs/text is still retained. We'd need to test if the image rendering workload would still be acceptable, of course.

EDIT: looks like there is progress towards writing text back with usvg (resvg #682) so that would be a major step towards being able to simplify the svg, resolve clashing element ID's and retaining text

@flxzt
Copy link
Owner

flxzt commented Feb 19, 2024

With usvg v0.40 text is now retained. However, poppler still draws the glyphs as paths when rendering pdf pages to cairo. That's the only blocker left for retaining text in the export.

@lokman2k5
Copy link
Contributor

With usvg v0.40 text is now retained. However, poppler still draws the glyphs as paths when rendering pdf pages to cairo. That's the only blocker left for retaining text in the export.

are there any changes regarding this matter? I'd like to be able to select text in a PDF, like xournalpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants