-
-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The URLs with non-ASCII characters don't work in PDFs #1011
Comments
Took a look at this and unfortunately it looks like this can't be fixed for the PDF output generation that goes through WkHtml2Pdf. The way the URL is formatted is actually completely valid going into the PDF converted - you can see it working in the preview and opening the links and navigating. Extended characters actually don't need to be url encoded in valid links. But unfortunately the wkHtml2Pdf converter takes it upon itself to encode the link, even if the link is explicitly URL Encoded already resulting in double encoding and ultimately a bad link. We can't control wkHtml2Pdf link encoding unfortunately so there's nothing that we can do to fix this for the PDF exported. However, as an alternative you can use Print to PDF (use the Print Icon and then Save As PDF). That output will generate the correct link formatting and navigate. |
So I've been screwing around with this today, and I think I have a solution. The problem is that wkhtmltopdf encodes links even if they are already encoded. MM's Markdown Parser (Markding but also PanDoc) automatically Url encode links that are not already Url encoded. Essentially every link that comes out of the Markdown Parser is UrlEncoded. wkhtmltopdf then brutely UrlEncodeds everything - again which is why the links fail. So the solution I came up with is to actually take the rendered Html file output and strip out all the URL encoding in all links, and then pass the updated render file to wkhtmltopdf. That appears to work and now I get links that are rendering. // HACK: wkhtmltopdf fucks up UrlEncoded links or links with extended characters
/// <summary>
/// Rewrites the HTML by URL decoding all links
/// letting wkhtmltopdf do the encoding to ensure that links work.
/// </summary>
/// <param name="htmlFile"></param>
public void UrlDecodeLinks(string htmlFile)
{
// URL Decode all links
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Read))
{
htmlDoc.Load(fs, System.Text.Encoding.UTF8);
}
var hrefs = htmlDoc.DocumentNode.SelectNodes("//a");
var docChanged = false;
foreach (var href in hrefs)
{
var link = href.Attributes["href"]?.Value;
if (!string.IsNullOrEmpty(link))
{
var newLink = WebUtility.UrlDecode(link);
if (newLink != link)
{
href.Attributes["href"].Value = newLink;
docChanged = true;
}
}
}
if (docChanged)
{
using (var fs = new FileStream(htmlFile, FileMode.Open, FileAccess.Write))
{
htmlDoc.Save(fs, System.Text.Encoding.UTF8);
}
}
} This code is only applied to the output generation with wkhtmltopdf (ie. the PDF options on the menu). The Print functinoality is not affected by this so that also continues to work as expected. This will be updated in |
Hi Rick. I have created a markdown file for a proof of concept. I created a markdown file with Markdown Monster, with two similar URLs:
As the error occurs in wkhtmltopdf, I decided to open an issue in their GitHub repository, but unfortunately the project has not been supported for more than a year, so the best solution is not feasible. |
Hello, Rick. I just installed version 2.8.9 and the PDF link generation is working fine. I had already assumed that a later version would fix the problem, although I hadn't installed it yet. But my intention was more than just that the problem would not manifest itself in Markdown Monster. What I wanted was for the problem to be fixed in the optimal way, i.e. for the bug in the wkhtmltopdf program to be removed, so that all users of the program could enjoy a version enjoy a version without the bug. |
Ok, thanks. As for MM though this issue is complete, since it now works for you right? |
Yes, Rick. In MM the behaviour is the expected one. |
If I create a PDF from a document that contains an URL with a non-ASCII char, as
ó
, the generated URLs is wrong. In the HTML format, the URL works properly.For example:
https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulación_de_la_verdad
or if the URL contains the urlencoded value for "ó", that is
%C3%B3
https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad
https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%C3%B3n_de_la_verdad
the URL in the PDF doesn't work. In both cases, the generated URL is:
https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad
https://es.wikipedia.org/wiki/On_bullshit:_sobre_la_manipulaci%25C3%25B3n_de_la_verdad
that contains
%25C3%25B3
instead of the right one,%C3%B3
. In the generated PDF every%
of%C3%B3
is replaced by%25
.The text was updated successfully, but these errors were encountered: