Use StringDecoder for Buffers in WritableStream #128

ajafff · 2015-04-19T15:00:39Z

http://stackoverflow.com/questions/12121775/convert-streamed-buffers-to-utf8-string

fb55 · 2015-04-19T20:37:44Z

Could you add a test case that fails without this?

ajafff · 2015-04-21T17:46:58Z

"use strict";

var ParserStream = require('htmlparser2').WritableStream;
var Buffer = require('buffer').Buffer;
var assert = require('assert');

var parser = new ParserStream({
    ontext:function(text){
        assert.equal(text, '€');
    }
});

parser.write(new Buffer([0xE2, 0x82]));
parser.write(new Buffer([0xAC]));

Without the fix this should fail AssertionError: "��" == "€"
After applying the fix those 2 Buffers will be concatenated and result in '€'

fb55 · 2015-04-21T19:42:21Z

Hm, AFAICT, this should be fixed when concatinating text events as discussed with @jails in #124.

ajafff · 2015-04-22T17:32:45Z

I get your point, but this problem is not inherent to text nodes. It could affect everything which contains characters that are not specified in ASCII (attributes, CDATA, comments, ...)
Also I looked at the implementation of high5. The concatenating of text events you mentioned is done by concatenating multiple string chunks to one string "buffer" (this._buffer)
Unfortunately it will not solve this problem, because

new Buffer([0xE2, 0x82]).toString() + new Buffer([0xAC]).toString() !== '€' //results in '���' instead

When working with Buffers in a streaming fashion you have to use StringDecoder to get utf8 right

philiiiiiipp · 2016-03-18T14:05:24Z

I think @ajafff is very right. I ran into trouble when parsing web pages, especially since the behaviour is quite unpredictable because a cut right between the two bytes happens quite rarely.

I think it would be beneficial to add this information to the /wiki/Parser-options. Would have saved me some troubles at least.

EDIT:
Ok, I am not sure if this is how it is supposed to go, but I just went ahead and wrote the note myself :-).

fb55 · 2016-03-18T16:51:11Z

I forgot about this, sorry. This needs a test case in the test dir (as a new file – have a look at api.js) and quotes have to be double quotes (that's why the tests fail). Looks good otherwise.

http://stackoverflow.com/questions/12121775/convert-streamed-buffers-to-utf8-string

coveralls · 2016-04-01T18:30:12Z

Coverage increased (+0.01%) to 95.673% when pulling d79e36d on ajafff:master into 9770f24 on fb55:master.

ajafff · 2016-04-01T18:32:34Z

@fb55 changed single quotes to double quotes, added test
PTAL

fb55 · 2016-04-01T18:34:42Z

Awesome, thanks!

Use StringDecoder for Buffers in WritableStream

d79e36d

http://stackoverflow.com/questions/12121775/convert-streamed-buffers-to-utf8-string

fb55 merged commit 2aad069 into fb55:master Apr 1, 2016

fb55 mentioned this pull request Oct 24, 2021

Regression in #914 #991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use StringDecoder for Buffers in WritableStream #128

Use StringDecoder for Buffers in WritableStream #128

ajafff commented Apr 19, 2015

fb55 commented Apr 19, 2015

ajafff commented Apr 21, 2015

fb55 commented Apr 21, 2015

ajafff commented Apr 22, 2015

philiiiiiipp commented Mar 18, 2016

fb55 commented Mar 18, 2016

coveralls commented Apr 1, 2016

ajafff commented Apr 1, 2016

fb55 commented Apr 1, 2016

Use StringDecoder for Buffers in WritableStream #128

Use StringDecoder for Buffers in WritableStream #128

Conversation

ajafff commented Apr 19, 2015

fb55 commented Apr 19, 2015

ajafff commented Apr 21, 2015

fb55 commented Apr 21, 2015

ajafff commented Apr 22, 2015

philiiiiiipp commented Mar 18, 2016

fb55 commented Mar 18, 2016

coveralls commented Apr 1, 2016

ajafff commented Apr 1, 2016

fb55 commented Apr 1, 2016