URLSearchParams 不能用在URL

0x709394

一堆链接🔗，有耐心看的话就不用看我废话了，看链接更好。

https://github.com/whatwg/url/issues/491
https://github.com/sindresorhus/got/issues/1234
https://github.com/sindresorhus/got/pull/1246

https://url.spec.whatwg.org/#urlsearchparams
https://url.spec.whatwg.org/#urlencoded-serializing
https://tools.ietf.org/html/rfc1738
https://tools.ietf.org/html/rfc3986

JavaScript 提供标准的Web API，中有一个新的 URL API来操作URL。一个URL object有两个属性跟 query string 有关。分别是search 和 searchParams。searchParams 是一个 URLSearchParams 然而这个URLSearchParams却不能用在URL上。

原因是URLSearchParams会使用x-www-form-urlencoded去转义。x-www-form-urlencoded是通过HTML form提交表单时采用的编码方式。（HTML Form提交表单的格式也是query string）

以2020年了连登录都不用HTTPS的学者网为例。

空格被转义成了+。

const u = new URLSearchParams()
u.set('a', ' ')
u.toString() // a=+

URL 的标准有两个一个是WHATWG，一个是IETF的。
IETF 的标准又有两个 1994 年的 RFC1738, 2005 年的 RFC3986。在 Node.js 中的 legecy url API, 就是 require('url')拿到的就是按照RFC3986。新版的Node.js 都会用 WHATWG URL 标准替代legecy url

RFC1738 Section2.2

Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.

RFC3986 Section 3.4 对url 中的 query有更加准确的定义

The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.

同时还提到

The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.

这么说形如 ?url=https://tools.ietf.org/html/rfc3986 这样的query不转义也是合法的。
Web API 中提供 encodeURI encodeURIComponment
encodeURI 不会encode query里面的url的冒号斜杠等字符
encodeURIComponment 则会encode
这种情况用哪个都行。

RFC3986 中对query的有关的BNF定义如下

 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   query         = *( pchar / "/" / "?" )

   fragment      = *( pchar / "/" / "?" )

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

*()的意思是括号内可以重复0-无穷次，读RFC经常会遇到。

RFC3984

URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section6).
For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.

unreversed character转不转义都符合标准。这里说的转移就是percent-encoded，也是我们常说的URLencode，就是一个百分号前加上十六进制的 byte 表示，而上面例子中的空格应该是%20。

可以看到~是属于unreserved的。而在x-wwww-form-urlencoded中，根据WHATWG 标准定义的算法~是会被转义的。这是除了空格以外的另一个不同之处。

如果事情发展如这里所设想的一样, 以后可能就会多一个RealURLSearchParams。为了兼容性Don't break the web 也是拼了。

Tover

题外话：想起orange说过URL解析的不一致(可能)会引起SSRF的

0x709394

Tover

https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf 看到了slides

0x709394

最近才发现，如果url是用户输入的要考虑的东西挺多的。但关键就是编码。一是不知道用户输入的是不是urlencoded的url。如果是从浏览器地址栏直接复制的一般会自动urlencode（显示的时候是decode的，中文url就很明显）。二是不知道内容的encoding。

对于第一个问题，我的做法是不管三七二十一先decode一遍。decode貌似是个幂等的操作。
对于第二个问题，近期才发现 https://html.spec.whatwg.org/multipage/parsing.html#encoding-sniffing-algorithm
大概就是 Header 的 MIME > BOM > meta charset
还有一些别的格式比如 XML 标准里规定encoding的那个东东要在开头 https://www.w3.org/TR/xml/#sec-TextDecl

0x709394

0x709394 decode貌似是个幂等的操作。

不。比如%2525

decodeURIComponent('%2525') // %25
decodeURIComponent(decodeURIComponent('%2525')) //%

https://tools.ietf.org/html/rfc3986#section-2.4

Implementations must not
percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent
data octet as the beginning of a percent-encoding, or vice versa in
the case of percent-encoding an already percent-encoded string.

0x709394

const isPercentEncoded = decodeURIComponent(url) !== url

但是空格转加号会扑街