Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double encoding/decoding UTF8 #315

Closed
calzoneman opened this issue Mar 16, 2015 · 8 comments · Fixed by socketio/engine.io-parser#81
Closed

Double encoding/decoding UTF8 #315

calzoneman opened this issue Mar 16, 2015 · 8 comments · Fixed by socketio/engine.io-parser#81

Comments

@calzoneman
Copy link

It appears that engine.io is double-decoding (and double-encoding) UTF8 strings for polling clients. In particular, if polling clients specify any content-type besides 'application/octet-stream', engine.io calls req.setEncoding('utf8'), the request data is decoded as UTF8, and the resulting string is passed on to engine.io-parser, which then attempts to call utf8.decode(). Since the string has already been decoded from UTF8, this fails*. This is what I've observed from tinkering with the server side, and from talking to @nuclearace about his issues I suspect messages being sent to polling clients are double-encoded too.

The following demonstrates the issue (note that I added a console.log(e) in the exception handler where utf8.decode() is called):

> p.decodePayload('20:42["stringTest","π"]')
[Error: Invalid continuation byte]

Interestingly, this alternative version works:

> b = new Buffer([0x32,0x31,0x3a,0x34,0x32,0x5b,0x22,0x73,0x74,0x72,0x69,0x6e,0x67,0x54,0x65,0x73,0x74,0x22,0x2c,0x22,0xcf,0x80,0x22,0x5d])
<Buffer 32 31 3a 34 32 5b 22 73 74 72 69 6e 67 54 65 73 74 22 2c 22 cf 80 22 5d>
> b.toString()
'21:42["stringTest","π"]'
> p.decodePayload(b.toString('binary'), function (x) { console.log(x) })
{ type: 'message', data: '2["stringTest","π"]' }

So the problem to me seems to be that the parser is expecting a raw string, but since the string is coming from node's HTTP server, it has already been decoded from UTF8, and this should only happen once.

*I assume the client is double encoding, otherwise polling would be completely broken. This issue affects 3rd-party clients that are single-encoding UTF8 strings.

EDIT: Cleaned up some terminology.

@mathiasbynens
Copy link

Some confusing use of incorrect terminology here. I’ll try to clear that up:

So the problem to me seems to be that the parser is expecting a raw string, but since the string is coming from node's HTTP server, it has already been decoded to UTF8 from UTF-8 to Unicode, and UTF8 decoding is not idempotent this should only happen once.

@calzoneman
Copy link
Author

Thanks for clarifying that.

On Mar 16, 2015, at 08:39, Mathias Bynens [email protected] wrote:

Some confusing use of terminology here. I’ll try to clear that up:

So the problem to me seems to be that the parser is expecting a raw string, but since the string is coming from node's HTTP server, it has already been decoded to UTF8 from UTF-8 to Unicode, and UTF8 decoding is not idempotent this should only happen once.


Reply to this email directly or view it on GitHub.

nuclearace added a commit to nuclearace/Socket.IO-Client-Swift that referenced this issue Mar 16, 2015
@miguelgrinberg
Copy link

I stumbled upon this issue in the reverse direction on my Python engine.io server (miguelgrinberg/Flask-SocketIO#246). The official JS engine.io client also sends strings with a double utf-8 encode to the server.

My solution is to check if a double utf-8 decode can be applied. If that succeeds, then I assume the packet must have been double encoded. If the second conversion fails due to invalid chars, then I assume the packet must have been single encoded.

@darrachequesne
Copy link
Member

@miguelgrinberg @calzoneman what do you suggest to fix that issue? should we remove utf8.encode/decode calls in engine.io-parser?

@miguelgrinberg
Copy link

@darrachequesne it's been a while since I looked at this. I don't remember the full details, but keep in mind that the websocket transport does not have this problem, I think the best place to address this is in the long-polling code, not on the parser.

@darrachequesne
Copy link
Member

What's weird is that, in engine.io-parser, both encodePacket and decodePacket methods (which are used by the polling transport) provide an utf8(en|de)code option, but encodePayload and decodePayload methods always set it to true:

exports.encodePacket = function (packet, supportsBinary, utf8encode, callback) {
  ...
  if (undefined !== packet.data) {
    encoded += utf8encode ? utf8.encode(String(packet.data)) : String(packet.data);
  }
  ...
}

exports.encodePayload = function (packets, supportsBinary, callback) {
  ...
    exports.encodePacket(packet, supportsBinary, true, function(message) {
      doneCallback(null, setLengthHeader(message));
    });
  ...
};

utf8encode was added in this commit: socketio/engine.io-parser@95840ca

@calzoneman
Copy link
Author

It's been a while since I dug into this issue, but I'll try to take some time this week to look at it again. I agree with Miguel; I think the problem is likely in the polling transport since the websocket transport does not appear to have this issue.

@darrachequesne
Copy link
Member

darrachequesne commented Dec 1, 2016

I think the difference between both transports is that polling uses parser.(en|de)codePayload whereas websocket uses parser.(en|de)codePacket:

WebSocket.prototype.send = function (packets) {
  packets.forEach(function (packet) {
    parser.encodePacket(packet, self.supportsBinary, function (data) {
  ... // utf8encode = undefined
};
// when
Polling.prototype.send = function (packets) {
  ...
  parser.encodePayload(packets, this.supportsBinary, function (data) {
  ... // encodePayload then calls encodePacket with utf8encode = true
};

And here:

exports.encodePacket = function (packet, supportsBinary, utf8encode, callback) {
  ...
  // data fragment is optional
  if (undefined !== packet.data) {
    encoded += utf8encode ? utf8.encode(String(packet.data)) : String(packet.data);
  }
  ..
};

The question being, is utf8encode flag even needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants