Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The URLs with non-ASCII characters don't work in PDFs #1011

Closed
oscaretu opened this issue Mar 5, 2023 · 7 comments
Closed

The URLs with non-ASCII characters don't work in PDFs #1011

oscaretu opened this issue Mar 5, 2023 · 7 comments
Assignees
Labels

Comments

@oscaretu
Copy link

oscaretu commented Mar 5, 2023

If I create a PDF from a document that contains an URL with a non-ASCII char, as ó, the generated URLs is wrong. In the HTML format, the URL works properly.

For example:

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulación_de_la_verdad

or if the URL contains the urlencoded value for "ó", that is %C3%B3

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad

the URL in the PDF doesn't work. In both cases, the generated URL is:

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad

that contains %25C3%25B3 instead of the right one, %C3%B3. In the generated PDF every % of %C3%B3 is replaced by %25.

@RickStrahl
Copy link
Owner

RickStrahl commented Mar 12, 2023

Took a look at this and unfortunately it looks like this can't be fixed for the PDF output generation that goes through WkHtml2Pdf.

The way the URL is formatted is actually completely valid going into the PDF converted - you can see it working in the preview and opening the links and navigating. Extended characters actually don't need to be url encoded in valid links.

But unfortunately the wkHtml2Pdf converter takes it upon itself to encode the link, even if the link is explicitly URL Encoded already resulting in double encoding and ultimately a bad link. We can't control wkHtml2Pdf link encoding unfortunately so there's nothing that we can do to fix this for the PDF exported.

However, as an alternative you can use Print to PDF (use the Print Icon and then Save As PDF). That output will generate the correct link formatting and navigate.

@RickStrahl
Copy link
Owner

So I've been screwing around with this today, and I think I have a solution.

The problem is that wkhtmltopdf encodes links even if they are already encoded. MM's Markdown Parser (Markding but also PanDoc) automatically Url encode links that are not already Url encoded. Essentially every link that comes out of the Markdown Parser is UrlEncoded.

wkhtmltopdf then brutely UrlEncodeds everything - again which is why the links fail.

So the solution I came up with is to actually take the rendered Html file output and strip out all the URL encoding in all links, and then pass the updated render file to wkhtmltopdf.

That appears to work and now I get links that are rendering.

// HACK: wkhtmltopdf fucks up UrlEncoded links or links with extended characters

/// <summary>
/// Rewrites the HTML by URL decoding all links
/// letting wkhtmltopdf do the encoding to ensure that links work.
/// </summary>
/// <param name="htmlFile"></param>
public void UrlDecodeLinks(string htmlFile)
{
    // URL Decode all links
    var htmlDoc = new HtmlAgilityPack.HtmlDocument();
    using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Read))
    {
        htmlDoc.Load(fs, System.Text.Encoding.UTF8);
    }

    var hrefs = htmlDoc.DocumentNode.SelectNodes("//a");
    var docChanged = false;
    foreach (var href in hrefs)
    {
        var link = href.Attributes["href"]?.Value;
        if (!string.IsNullOrEmpty(link))
        {
            var newLink = WebUtility.UrlDecode(link);
            if (newLink != link)
            {
                href.Attributes["href"].Value = newLink;
                docChanged = true;
            }
        }
    }

    if (docChanged)
    {
        using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Write))
        {
            htmlDoc.Save(fs, System.Text.Encoding.UTF8);
        }
    }
}

This code is only applied to the output generation with wkhtmltopdf (ie. the PDF options on the menu). The Print functinoality is not affected by this so that also continues to work as expected.

This will be updated in v2.8.8.4 and forward.

@oscaretu
Copy link
Author

Hi Rick.

I have created a markdown file for a proof of concept. I created a markdown file with Markdown Monster, with two similar URLs:

  • V1 URL contains an ó char. When I export the file as HTML, Markdown Monster converts that ó to %C3%B3. If I edit the HTML file and restore the original ó, the link in the PDF generated with wkhtmltopdf works as expected.
  • The V2 URL contains %C3%B3. When I export the file as HTML, it remains as %C3%B3. In the PDF generated with wkhtmltopdf, the link fails.

As the error occurs in wkhtmltopdf, I decided to open an issue in their GitHub repository, but unfortunately the project has not been supported for more than a year, so the best solution is not feasible.

@RickStrahl
Copy link
Owner

Did you try the latest version as mentioned? This should now work as the URLs are decoded before wkhtmltopdf runs against the HTML, so there's no double encoding.

This works for me (v2.8.9):

image

@oscaretu
Copy link
Author

Hello, Rick.

I just installed version 2.8.9 and the PDF link generation is working fine. I had already assumed that a later version would fix the problem, although I hadn't installed it yet. But my intention was more than just that the problem would not manifest itself in Markdown Monster. What I wanted was for the problem to be fixed in the optimal way, i.e. for the bug in the wkhtmltopdf program to be removed, so that all users of the program could enjoy a version enjoy a version without the bug.

@RickStrahl
Copy link
Owner

Ok, thanks. As for MM though this issue is complete, since it now works for you right?

@oscaretu
Copy link
Author

Yes, Rick. In MM the behaviour is the expected one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants