Skip to content

Commit

Permalink
Implement new approach for URI normalization (fixes #287)
Browse files Browse the repository at this point in the history
  • Loading branch information
colinodell committed Aug 8, 2017
1 parent 1162ff7 commit 7d91ca0
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 1 deletion.
49 changes: 48 additions & 1 deletion src/Util/UrlEncoder.php
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,53 @@ public static function unescapeAndEncode($uri)
{
$decoded = html_entity_decode($uri);

return strtr(rawurlencode(rawurldecode($decoded)), self::$dontEncode);
return self::encode(self::decode($decoded));
}

/**
* Decode a percent-encoded URI
*
* @param string $uri
*
* @return string
*/
private static function decode($uri)
{
return preg_replace_callback('/%([0-9a-f]{2})/iu', function($matches) {

This comment has been minimized.

Copy link
@glensc

glensc Mar 12, 2019

Contributor

I'm wondering, if you remove /i and just add A-F to the pattern, which micro-optimization is "better"? or in other words, how long the input string has to be for this to even start to matter?

This comment has been minimized.

Copy link
@colinodell

colinodell Mar 12, 2019

Author Member

The input string would probably have to be very long. I used Blackfire to profile this change against our tests and found the difference was negligible for "normal" inputs.

This comment has been minimized.

Copy link
@glensc

glensc Mar 13, 2019

Contributor

Thanks for satisfying my curiosity! :)

// Convert percent-encoded codes to uppercase
$upper = strtoupper($matches[0]);
// Keep excluded characters as-is
if (array_key_exists($upper, self::$dontEncode)) {
return $upper;
}

// Otherwise, return the character for this codepoint
return chr(hexdec($matches[1]));
}, $uri);
}

/**
* Encode a URI, preserving already-encoded and excluded characters
*
* @param string $uri
*
* @return string
*/
private static function encode($uri)
{
return preg_replace_callback('/(%[0-9a-f]{2})|./iu', function($matches){
// Keep already-encoded characters as-is
if (count($matches) > 1) {
return $matches[0];
}

// Keep excluded characters as-is
if (in_array($matches[0], self::$dontEncode)) {
return $matches[0];
}

// Otherwise, encode the character
return rawurlencode($matches[0]);
}, $uri);
}
}
5 changes: 5 additions & 0 deletions tests/unit/Util/UrlEncoderTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ public function unescapeAndEncodeTestProvider()
['<', '%3C'],
['>', '%3E'],
['?', '?'],
['https://en.wikipedia.org/wiki/Markdown#CommonMark', 'https://en.wikipedia.org/wiki/Markdown#CommonMark'],
['https://img.shields.io/badge/help-%23hoaproject-ff0066.svg', 'https://img.shields.io/badge/help-%23hoaproject-ff0066.svg'],
['http://example.com/a%62%63%2fd%3Fe', 'http://example.com/abc%2Fd%3Fe'],
['http://ko.wikipedia.org/wiki/위키백과:대문', 'http://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8'],
['http://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8', 'http://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8'],
];
}
}

0 comments on commit 7d91ca0

Please sign in to comment.