The URLs with non-ASCII characters don't work in PDFs #1011

oscaretu · 2023-03-05T00:08:34Z

If I create a PDF from a document that contains an URL with a non-ASCII char, as ó, the generated URLs is wrong. In the HTML format, the URL works properly.

For example:

On bullshit: sobre la manipulación de la verdad

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulación_de_la_verdad

or if the URL contains the urlencoded value for "ó", that is %C3%B3

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad

the URL in the PDF doesn't work. In both cases, the generated URL is:

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad

https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad

that contains %25C3%25B3 instead of the right one, %C3%B3. In the generated PDF every % of %C3%B3 is replaced by %25.

The text was updated successfully, but these errors were encountered:

RickStrahl · 2023-03-12T23:53:18Z

Took a look at this and unfortunately it looks like this can't be fixed for the PDF output generation that goes through WkHtml2Pdf.

The way the URL is formatted is actually completely valid going into the PDF converted - you can see it working in the preview and opening the links and navigating. Extended characters actually don't need to be url encoded in valid links.

But unfortunately the wkHtml2Pdf converter takes it upon itself to encode the link, even if the link is explicitly URL Encoded already resulting in double encoding and ultimately a bad link. We can't control wkHtml2Pdf link encoding unfortunately so there's nothing that we can do to fix this for the PDF exported.

However, as an alternative you can use Print to PDF (use the Print Icon and then Save As PDF). That output will generate the correct link formatting and navigate.

RickStrahl · 2023-03-13T07:28:53Z

So I've been screwing around with this today, and I think I have a solution.

The problem is that wkhtmltopdf encodes links even if they are already encoded. MM's Markdown Parser (Markding but also PanDoc) automatically Url encode links that are not already Url encoded. Essentially every link that comes out of the Markdown Parser is UrlEncoded.

wkhtmltopdf then brutely UrlEncodeds everything - again which is why the links fail.

So the solution I came up with is to actually take the rendered Html file output and strip out all the URL encoding in all links, and then pass the updated render file to wkhtmltopdf.

That appears to work and now I get links that are rendering.

// HACK: wkhtmltopdf fucks up UrlEncoded links or links with extended characters

/// <summary>
/// Rewrites the HTML by URL decoding all links
/// letting wkhtmltopdf do the encoding to ensure that links work.
/// </summary>
/// <param name="htmlFile"></param>
public void UrlDecodeLinks(string htmlFile)
{
    // URL Decode all links
    var htmlDoc = new HtmlAgilityPack.HtmlDocument();
    using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Read))
    {
        htmlDoc.Load(fs, System.Text.Encoding.UTF8);
    }

    var hrefs = htmlDoc.DocumentNode.SelectNodes("//a");
    var docChanged = false;
    foreach (var href in hrefs)
    {
        var link = href.Attributes["href"]?.Value;
        if (!string.IsNullOrEmpty(link))
        {
            var newLink = WebUtility.UrlDecode(link);
            if (newLink != link)
            {
                href.Attributes["href"].Value = newLink;
                docChanged = true;
            }
        }
    }

    if (docChanged)
    {
        using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Write))
        {
            htmlDoc.Save(fs, System.Text.Encoding.UTF8);
        }
    }
}

This code is only applied to the output generation with wkhtmltopdf (ie. the PDF options on the menu). The Print functinoality is not affected by this so that also continues to work as expected.

This will be updated in v2.8.8.4 and forward.

oscaretu · 2023-03-15T18:46:31Z

Hi Rick.

I have created a markdown file for a proof of concept. I created a markdown file with Markdown Monster, with two similar URLs:

V1 URL contains an ó char. When I export the file as HTML, Markdown Monster converts that ó to %C3%B3. If I edit the HTML file and restore the original ó, the link in the PDF generated with wkhtmltopdf works as expected.
The V2 URL contains %C3%B3. When I export the file as HTML, it remains as %C3%B3. In the PDF generated with wkhtmltopdf, the link fails.

As the error occurs in wkhtmltopdf, I decided to open an issue in their GitHub repository, but unfortunately the project has not been supported for more than a year, so the best solution is not feasible.

RickStrahl · 2023-03-15T19:39:11Z

Did you try the latest version as mentioned? This should now work as the URLs are decoded before wkhtmltopdf runs against the HTML, so there's no double encoding.

This works for me (v2.8.9):

oscaretu · 2023-03-15T20:15:50Z

Hello, Rick.

I just installed version 2.8.9 and the PDF link generation is working fine. I had already assumed that a later version would fix the problem, although I hadn't installed it yet. But my intention was more than just that the problem would not manifest itself in Markdown Monster. What I wanted was for the problem to be fixed in the optimal way, i.e. for the bug in the wkhtmltopdf program to be removed, so that all users of the program could enjoy a version enjoy a version without the bug.

RickStrahl · 2023-03-15T21:12:21Z

Ok, thanks. As for MM though this issue is complete, since it now works for you right?

oscaretu · 2023-03-16T06:58:59Z

Yes, Rick. In MM the behaviour is the expected one.

RickStrahl self-assigned this Mar 12, 2023

RickStrahl added bug can't fix labels Mar 12, 2023

RickStrahl removed the can't fix label Mar 13, 2023

RickStrahl closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The URLs with non-ASCII characters don't work in PDFs #1011

The URLs with non-ASCII characters don't work in PDFs #1011

oscaretu commented Mar 5, 2023

RickStrahl commented Mar 12, 2023 •

edited

Loading

RickStrahl commented Mar 13, 2023

oscaretu commented Mar 15, 2023

RickStrahl commented Mar 15, 2023

oscaretu commented Mar 15, 2023

RickStrahl commented Mar 15, 2023

oscaretu commented Mar 16, 2023

The URLs with non-ASCII characters don't work in PDFs #1011

The URLs with non-ASCII characters don't work in PDFs #1011

Comments

oscaretu commented Mar 5, 2023

RickStrahl commented Mar 12, 2023 • edited Loading

RickStrahl commented Mar 13, 2023

oscaretu commented Mar 15, 2023

RickStrahl commented Mar 15, 2023

oscaretu commented Mar 15, 2023

RickStrahl commented Mar 15, 2023

oscaretu commented Mar 16, 2023

RickStrahl commented Mar 12, 2023 •

edited

Loading