Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StringDecoder for Buffers in WritableStream #128

Merged
merged 1 commit into from
Apr 1, 2016
Merged

Use StringDecoder for Buffers in WritableStream #128

merged 1 commit into from
Apr 1, 2016

Conversation

ajafff
Copy link
Contributor

@ajafff ajafff commented Apr 19, 2015

@fb55
Copy link
Owner

fb55 commented Apr 19, 2015

Could you add a test case that fails without this?

@ajafff
Copy link
Contributor Author

ajafff commented Apr 21, 2015

"use strict";

var ParserStream = require('htmlparser2').WritableStream;
var Buffer = require('buffer').Buffer;
var assert = require('assert');

var parser = new ParserStream({
    ontext:function(text){
        assert.equal(text, '€');
    }
});

parser.write(new Buffer([0xE2, 0x82]));
parser.write(new Buffer([0xAC]));

Without the fix this should fail AssertionError: "��" == "€"
After applying the fix those 2 Buffers will be concatenated and result in '€'

@fb55
Copy link
Owner

fb55 commented Apr 21, 2015

Hm, AFAICT, this should be fixed when concatinating text events as discussed with @jails in #124.

@ajafff
Copy link
Contributor Author

ajafff commented Apr 22, 2015

I get your point, but this problem is not inherent to text nodes. It could affect everything which contains characters that are not specified in ASCII (attributes, CDATA, comments, ...)
Also I looked at the implementation of high5. The concatenating of text events you mentioned is done by concatenating multiple string chunks to one string "buffer" (this._buffer)
Unfortunately it will not solve this problem, because

new Buffer([0xE2, 0x82]).toString() + new Buffer([0xAC]).toString() !== '€' //results in '���' instead

When working with Buffers in a streaming fashion you have to use StringDecoder to get utf8 right

@philiiiiiipp
Copy link

I think @ajafff is very right. I ran into trouble when parsing web pages, especially since the behaviour is quite unpredictable because a cut right between the two bytes happens quite rarely.

I think it would be beneficial to add this information to the /wiki/Parser-options. Would have saved me some troubles at least.

EDIT:
Ok, I am not sure if this is how it is supposed to go, but I just went ahead and wrote the note myself :-).

@fb55
Copy link
Owner

fb55 commented Mar 18, 2016

I forgot about this, sorry. This needs a test case in the test dir (as a new file – have a look at api.js) and quotes have to be double quotes (that's why the tests fail). Looks good otherwise.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.673% when pulling d79e36d on ajafff:master into 9770f24 on fb55:master.

@ajafff
Copy link
Contributor Author

ajafff commented Apr 1, 2016

@fb55 changed single quotes to double quotes, added test
PTAL

@fb55 fb55 merged commit 2aad069 into fb55:master Apr 1, 2016
@fb55
Copy link
Owner

fb55 commented Apr 1, 2016

Awesome, thanks!

@fb55 fb55 mentioned this pull request Oct 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants