字符类:[...], [^...]

字符类匹配自定义字符集中或不包含的任何字符。当 v 标志启用时,它还可以用于匹配有限长度的字符串。

¥A character class matches any character in or not in a custom set of characters. When the v flag is enabled, it can also be used to match finite-length strings.

语法

¥Syntax

regex
[]
[abc]
[A-Z]

[^]
[^abc]
[^A-Z]

// `v` mode only
[operand1&&operand2]
[operand1--operand2]
[\q{substring}]

参数

¥Parameters

operand1operand2

: 可以是单个字符、另一个方括号括起来的字符类、字符类转义Unicode 字符类转义 或使用 \q 语法的字符串。

substring

一个文字字符串。

描述

¥Description

字符类指定方括号之间的字符列表并匹配列表中的任何字符。v 标志极大地改变了字符类的解析和解释方式。以下语法在 v 模式和非 v 模式下均可用:

¥A character class specifies a list of characters between square brackets and matches any character in the list. The v flag drastically changes how character classes are parsed and interpreted. The following syntaxes are available in both v mode and non-v mode:

  • 单个字符:与角色本身匹配。
  • 字符范围:匹配包含范围内的任何字符。该范围由破折号 (-) 分隔的两个字符指定。第一个字符的字符值必须小于第二个字符。字符值是字符的 Unicode 代码点。由于 Unicode 代码点通常按顺序分配给字母表,因此 [a-z] 指定所有小写拉丁字符,而 [α-ω] 指定所有小写希腊字符。在 Unicode 不识别模式 中,正则表达式被解释为 BMP 字符的序列。因此,字符类中的代理对代表两个字符而不是一个;详细信息请参见下文。
  • 转义序列:\b\-字符类转义Unicode 字符类转义 和其他 字符转义

这些语法可以出现任意多次,并且它们表示的字符集是联合的。例如,/[a-zA-Z0-9]/ 匹配任何字母或数字。

¥These syntaxes can occur any number of times, and the character sets they represent are unioned. For example, /[a-zA-Z0-9]/ matches any letter or digit.

字符类中的 ^ 前缀创建补码类。例如,[^abc] 匹配除 abc 之外的任何字符。当 ^ 字符出现在字符类中间时,它就是一个文字字符 — 例如,[a^b] 匹配字符 a^b

¥The ^ prefix in a character class creates a complement class. For example, [^abc] matches any character except a, b, or c. The ^ character is a literal character when it appears in the middle of a character class — for example, [a^b] matches the characters a, ^, and b.

词汇语法 对正则表达式文字进行非常粗略的解析,因此它不会以字符类中出现的 / 字符结束正则表达式文字。这意味着 /[/]/ 有效,无需转义 /

¥The lexical grammar does a very rough parse of regex literals, so that it does not end the regex literal at a / character which appears within a character class. This means /[/]/ is valid without needing to escape the /.

字符范围的边界不得指定多个字符,如果你使用 字符类转义,就会出现这种情况。例如:

¥The boundaries of a character range must not specify more than one character, which happens if you use a character class escape. For example:

js
/[\s-9]/u; // SyntaxError: Invalid regular expression: Invalid character class

Unicode 不识别模式 中,一个边界是字符类的字符范围使 - 成为文字字符。这是 已弃用的 Web 兼容性语法,你不应该依赖它。

¥In Unicode-unaware mode, character ranges where one boundary is a character class makes the - become a literal character. This is a deprecated syntax for web compatibility, and you should not rely on it.

js
/[\s-9]/.test("-"); // true

Unicode 不识别模式 中,正则表达式被解释为 BMP 字符序列。因此,字符类中的代理对代表两个字符而不是一个。

¥In Unicode-unaware mode, regexes are interpreted as a sequence of BMP characters. Therefore, surrogate pairs in character classes represent two characters instead of one.

js
/[😄]/.test("\ud83d"); // true
/[😄]/u.test("\ud83d"); // false

/[😄-😛]/.test("😑"); // SyntaxError: Invalid regular expression: /[😄-😛]/: Range out of order in character class
/[😄-😛]/u.test("😑"); // true

即使模式 忽略大小写,范围两端的情况对于确定哪些字符属于该范围也很重要。例如,模式 /[E-F]/i 仅匹配 EFef,而模式 /[E-f]/i 匹配所有大写和小写 ASCII 字母(因为它跨越 E–Za–f),以及 [\]^_ 和 ```。

¥Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. For example, the pattern /[E-F]/i only matches E, F, e, and f, while the pattern /[E-f]/i matches all uppercase and lowercase ASCII letters (because it spans over E–Z and a–f), as well as [, \, ], ^, _, and `.

非 v 模式字符类

¥Non-v-mode character class

v 模式字符类解释大多数字符 literally,并且对其可以包含的字符的限制较少。例如,. 是原义点字符,而不是 wildcard。唯一不能按字面形式出现的字符是 \]-

¥Non-v-mode character classes interpret most character literally and have less restrictions about the characters they can contain. For example, . is the literal dot character, not the wildcard. The only characters that cannot appear literally are \, ], and -.

  • 在字符类中,支持大多数转义序列(\b\Bbackreferences 除外)。\b 表示退格字符而不是 字边界,而其他两个会导致语法错误。要按字面意思使用 \,请将其转义为 \\
  • ] 字符表示字符类的结束。从字面上看,将其转义为 \]
  • 破折号 (-) 字符在两个字符之间使用时表示范围。当它出现在字符类的开头或结尾时,它是一个文字字符。当它用在范围的边界时,它也是一个文字字符。例如,[a-] 匹配字符 a-[!--] 匹配字符 !-[--9] 匹配字符 -9。如果你想在任何地方使用它,也可以将其转义为 \-

v-模式字符类

¥v-mode character class

v 模式中角色类别的基本思想保持不变:你仍然可以按字面意思使用大多数字符,使用 - 表示字符范围,并使用转义序列。v 标志最重要的功能之一是字符类中的设置表示法。如前所述,普通字符类可以通过连接两个范围来表示并集,例如使用 [A-Z0-9] 表示“集合 [A-Z] 和集合 [0-9] 的并集”。但是,没有简单的方法可以用字符集表示其他操作,例如交集和差集。

¥The basic idea of character classes in v mode remains the same: you can still use most characters literally, use - to denote character ranges, and use escape sequences. One of the most important features of the v flag is set notation within character classes. As previously mentioned, normal character classes can express unions by concatenating two ranges, such as using [A-Z0-9] to mean "the union of the set [A-Z] and the set [0-9]". However, there's no easy way to represent other operations with character sets, such as intersection and difference.

对于 v 标志,交集用 && 表示,减法用 -- 表示。两者缺席意味着联合。&&-- 的两个操作数可以是字符、字符转义、字符类转义,甚至是另一个字符类。例如,要表示 "非下划线的单词字符",可以使用 [\w--_]。不能在同一级别混合使用运算符。例如,[\w&&[A-z]--_] 是语法错误。但是,由于可以嵌套字符类,因此可以通过编写 [\w&&[[A-z]--_]][[\w&&[A-z]]--_](两者都表示 [A-Za-z])来明确表示。同样,[AB--C] 无效,需要写 [A[B--C]](即 [AB])。

¥With the v flag, intersection is expressed with &&, and subtraction with --. The absence of both implies union. The two operands of && or -- can be a character, character escape, character class escape, or even another character class. For example, to express "a word character that's not an underscore", you can use [\w--_]. You cannot mix operators on the same level. For example, [\w&&[A-z]--_] is a syntax error. However, because you can nest character classes, you can be explicit by writing [\w&&[[A-z]--_]] or [[\w&&[A-z]]--_] (which both mean [A-Za-z]). Similarly, [AB--C] is invalid and you need to write [A[B--C]] (which just means [AB]).

v 模式下,Unicode 字符类转义 \p 可以匹配有限长度的字符串,例如表情符号。为了对称性,常规字符类也可以匹配多个字符。要在字符类中写入 "字符串字面量",请将字符串封装在 \q{...} 中。这里支持的唯一正则表达式语法是 disjunction - 除此之外,\q 必须完全包含文字(包括转义字符)。这确保了字符类只能匹配具有有限多种可能性的有限长度字符串。

¥In v mode, the Unicode character class escape \p can match finite-length strings, such as emojis. For symmetry, regular character classes can also match more than one character. To write a "string literal" in a character class, you wrap the string in \q{...}. The only regex syntax supported here is disjunction — apart from this, \q must completely enclose literals (including escaped characters). This ensures that character classes can only match finite-length strings with finitely many possibilities.

由于字符类语法现在更加复杂,因此更多的字符被保留并禁止按字面形式出现。

¥Because the character class syntax is now more sophisticated, more characters are reserved and forbidden from appearing literally.

  • 除了 ]\ 之外,如果以下字符表示文字字符,则必须在字符类中转义:(, ), [, {, }, /, -, |.该列表有点类似于 语法字符 的列表,不同之处在于 ^$*+? 在字符类内部不保留,而 /- 在字符类外部不保留(尽管 / 可能会分隔正则表达式文字,因此 仍然需要转义)。所有这些字符也可以选择在 u 模式字符类中转义。
  • 以下 "双标点符号" 序列也必须进行转义(但无论如何,如果没有 v 标志,它们就没有多大意义):&&, !!, ##, $$, %%, **, ++, ,,, .., ::, ;;, <<, ==, >>, ??, @@, ^^, ````, ~~.在 u 模式下,其中一些字符只能按字面意思出现在字符类中,并且在转义时会导致语法错误。在 v 模式下,它们成对出现时必须转义,但单独出现时可以选择转义。例如,/[\!]/u 无效,因为它是 身份逃避,但 /[\!]/v/[!]/v 都有效,而 /[!!]/v 无效。文字字符 参考有一个详细的表格,其中列出了可以转义或未转义的字符。

补码字符类 [^...] 不可能匹配长于一个字符的字符串。例如,[\q{ab|c}] 是有效的,并且与字符串 "ab" 匹配,但 [^\q{ab|c}] 是无效的,因为不清楚应该消耗多少个字符。检查是通过检查所有 \q 是否包含单个字符以及所有 \p 是否指定字符属性来完成的 - 对于联合,所有操作数都必须是纯字符;对于交集,至少一个操作数必须是纯字符;对于减法,最左边的操作数必须是纯字符。该检查是语法检查,而不查看指定的实际字符集,这意味着虽然 /[^\q{ab|c}--\q{ab}]/v 等同于 /[^c]/v,但它仍然被拒绝。

¥Complement character classes [^...] cannot possibly be able to match strings longer than one character. For example, [\q{ab|c}] is valid and matches the string "ab", but [^\q{ab|c}] is invalid because it's unclear how many characters should be consumed. The check is done by checking if all \q contain single characters and all \p specify character properties — for unions, all operands must be purely characters; for intersections, at least one operand must be purely characters; for subtraction, the leftmost operand must be purely characters. The check is syntactic without looking at the actual character set being specified, which means although /[^\q{ab|c}--\q{ab}]/v is equivalent to /[^c]/v, it's still rejected.

补充类和不区分大小写的匹配

¥Complement classes and case-insensitive matching

在非 v 模式下,补码字符类 [^...] 是通过简单地反转匹配结果来实现的,即只要 [...] 不匹配,[^...] 就匹配,反之亦然。然而,其他补集类(例如 \P{...}\W)通过预构造由所有不具有指定属性的字符组成的集合来工作。它们似乎会产生相同的行为,但与 case-insensitive 匹配结合使用时会变得更加复杂。

¥In non-v-mode, complement character classes [^...] are implemented by simply inverting the match result — that is, [^...] matches whenever [...] doesn't match, and vice versa. However, the other complement classes, such as \P{...} and \W, work by eagerly constructing the set consisting of all characters without the specified property. They seem to produce the same behavior, but are made more complex when combined with case-insensitive matching.

考虑以下两个正则表达式:

¥Consider the following two regexes:

js
const r1 = /\p{Lowercase_Letter}/iu;
const r2 = /[^\P{Lowercase_Letter}]/iu;

r2 是双重否定,似乎与 r1 等效。但事实上,r1 匹配所有小写和大写 ASCII 字母,而 r2 则不匹配。为了说明它是如何工作的,假设我们只处理 ASCII 字符,而不是整个 Unicode 字符集,并且 r1r2 指定如下:

¥The r2 is a double negation and seems to be equivalent with r1. But in fact, r1 matches all lower- and upper-case ASCII letters, while r2 matches none. To illustrate how it works, pretend that we are only dealing with ASCII characters, not the entire Unicode character set, and r1 and r2 are specified as below:

js
const r1 = /[a-z]/iu;
const r2 = /[^A-Z]/iu;

回想一下,不区分大小写的匹配是通过将模式和输入折叠为相同的大小写来实现的(有关更多详细信息,请参阅 ignoreCase)。对于 r1,字符类 a-z 在大小写折叠后保持不变,而大写和小写 ASCII 字符串输入都折叠为小写,因此 r1 能够匹配 "A""a"。对于 r2,字符类 A-Z 被折叠到 a-z;然而,^ 否定了匹配结果,因此 [^A-Z] 实际上只匹配大写字符串。但是,大写和小写 ASCII 字符串输入仍会折叠为小写,导致 r2 不匹配任何内容。

¥Recall that case-insensitive matching happens by folding both the pattern and the input to the same case (see ignoreCase for more details). For r1, the character class a-z stays the same after case folding, while both upper- and lower-case ASCII string inputs are folded to lower-case, so r1 is able to match both "A" and "a". For r2, the character class A-Z is folded to a-z; however, ^ negates the match result, so that [^A-Z] in effect only matches upper-case strings. However, both upper- and lower-case ASCII string inputs are still folded to lower-case, causing r2 to match nothing.

v 模式中,此行为是固定的 - [^...] 也预构造补集类,而不是否定匹配结果。这使得 [^\P{Lowercase_Letter}]\p{Lowercase_Letter} 严格等价。

¥In v mode, this behavior is fixed — [^...] also eagerly constructs the complement class instead of negating the match result. This makes [^\P{Lowercase_Letter}] and \p{Lowercase_Letter} are strictly equivalent.

示例

¥Examples

匹配十六进制数字

¥Matching hexadecimal digits

以下函数确定字符串是否包含有效的十六进制数字:

¥The following function determines whether a string contains a valid hexadecimal number:

js
function isHexadecimal(str) {
  return /^[0-9A-F]+$/i.test(str);
}

isHexadecimal("2F3"); // true
isHexadecimal("beef"); // true
isHexadecimal("undefined"); // false

使用交集

¥Using intersection

以下函数匹配希腊字母。

¥The following function matches Greek letters.

js
function greekLetters(str) {
  return str.match(/[\p{Script_Extensions=Greek}&&\p{Letter}]/gv);
}

// 𐆊 is U+1018A GREEK ZERO SIGN
greekLetters("π𐆊P0零αAΣ"); // [ 'π', 'α', 'Σ' ]

使用减法

¥Using subtraction

以下函数匹配所有非 ASCII 数字。

¥The following function matches all non-ASCII numbers.

js
function nonASCIINumbers(str) {
  return str.match(/[\p{Decimal_Number}--[0-9]]/gv);
}

// 𑜹 is U+11739 AHOM DIGIT NINE
nonASCIINumbers("𐆊0零1𝟜𑜹a"); // [ '𝟜', '𑜹' ]

匹配字符串

¥Matching strings

以下函数匹配所有行终止符序列,包括 行终止符 和序列 \r\n (CRLF)。

¥The following function matches all line terminator sequences, including the line terminator characters and the sequence \r\n (CRLF).

js
function getLineTerminators(str) {
  return str.match(/[\r\n\u2028\u2029\q{\r\n}]/gv);
}

getLineTerminators(`
A poem\r
Is split\r\n
Into many
Stanzas
`); // [ '\r', '\r\n', '\n' ]

此示例与 /(?:\r|\n|\u2028|\u2029|\r\n)/gu/(?:[\r\n\u2028\u2029]|\r\n)/gu 完全相同,只是更短。

¥This example is exactly equivalent to /(?:\r|\n|\u2028|\u2029|\r\n)/gu or /(?:[\r\n\u2028\u2029]|\r\n)/gu, except shorter.

\q{} 最有用的情况是做减法和交集时。以前,这可以通过 多重前瞻 实现。以下函数匹配不属于美国、中国、俄罗斯、英国和法国国旗之一的国旗。

¥The most useful case of \q{} is when doing subtraction and intersection. Previously, this was possible with multiple lookaheads. The following function matches flags that are not one of the American, Chinese, Russian, British, and French flags.

js
function notUNSCPermanentMember(flag) {
  return /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$/v.test(flag);
}

notUNSCPermanentMember("🇺🇸"); // false
notUNSCPermanentMember("🇩🇪"); // true

此示例基本上与 /^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{RGI_Emoji_Flag_Sequence}$/v 等效,只是性能可能更高。

¥This example is mostly equivalent to /^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{RGI_Emoji_Flag_Sequence}$/v, except perhaps more performant.

规范

Specification
ECMAScript Language Specification
# prod-CharacterClass

¥Specifications

浏览器兼容性

BCD tables only load in the browser

¥Browser compatibility

也可以看看