一堆链接🔗, 有耐心看的话就不用看我废话了,看链接更好。
https://github.com/whatwg/url/issues/491
https://github.com/sindresorhus/got/issues/1234
https://github.com/sindresorhus/got/pull/1246
https://url.spec.whatwg.org/#urlsearchparams
https://url.spec.whatwg.org/#urlencoded-serializing
https://tools.ietf.org/html/rfc1738
https://tools.ietf.org/html/rfc3986
JavaScript 提供标准的Web API, 中有一个新的 URL API来操作URL。 一个URL object有两个属性跟 query string 有关。分别是search
和 searchParams
。searchParams 是一个 URLSearchParams 然而这个URLSearchParams却不能用在URL上。
原因是URLSearchParams会使用x-www-form-urlencoded
去转义。x-www-form-urlencoded
是通过HTML form提交表单时采用的编码方式。(HTML Form提交表单的格式也是query string)
以2020年了连登录都不用HTTPS的学者网为例。
空格被转义成了+
。
const u = new URLSearchParams()
u.set('a', ' ')
u.toString() // a=+
URL 的标准有两个一个是WHATWG, 一个是IETF的。
IETF 的标准又有两个 1994 年的 RFC1738, 2005 年的 RFC3986。 在 Node.js 中的 legecy url API, 就是 require('url')
拿到的就是按照RFC3986。新版的Node.js 都会用 WHATWG URL 标准替代legecy url
RFC1738 Section2.2
Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.
RFC3986 Section 3.4 对url 中的 query有更加准确的定义
The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
同时还提到
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.
这么说形如 ?url=https://tools.ietf.org/html/rfc3986
这样的query不转义也是合法的。
Web API 中提供 encodeURI
encodeURIComponment
encodeURI
不会encode query里面的url的冒号斜杠等字符
encodeURIComponment
则会encode
这种情况用哪个都行。
RFC3986 中对query的有关的BNF定义如下
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
*()的意思是括号内可以重复0-无穷次,读RFC经常会遇到。
RFC3984
URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section6).
For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.
unreversed character转不转义都符合标准。这里说的转移就是percent-encoded, 也是我们常说的URLencode, 就是一个百分号前加上十六进制的 byte 表示,而上面例子中的空格应该是%20
。
可以看到~
是属于unreserved的。而在x-wwww-form-urlencoded
中,根据WHATWG 标准定义的算法~
是会被转义的。这是除了空格以外的另一个不同之处。
如果事情发展如这里所设想的一样, 以后可能就会多一个RealURLSearchParams
。 为了兼容性Don't break the web
也是拼了。