Unicode 字符类转义:\\p{...},\\P{...}
unicode 字符类转义是一种 字符类转义,它匹配由 Unicode 属性指定的一组字符。仅 Unicode 识别模式 支持。当 v
标志启用时,它还可以用于匹配有限长度的字符串。
¥A unicode character class escape is a kind of character class escape that matches a set of characters specified by a Unicode property. It's only supported in Unicode-aware mode. When the v
flag is enabled, it can also be used to match finite-length strings.
Try it
语法
参数
¥Parameters
loneProperty
-
单独的 Unicode 属性名称或值,遵循与
value
相同的语法。它指定General_Category
属性或 二进制属性名称 的值。在v
模式下,它也可以是 字符串的二进制 Unicode 属性。注意:ICU 语法也允许省略
Script
属性名称,但 JavaScript 不支持这一点,因为大多数时候Script_Extensions
比Script
更有用。¥Note: ICU syntax allows omitting the
Script
property name as well, but JavaScript does not support this, because most of the timeScript_Extensions
is more useful thanScript
. property
-
Unicode 属性名称。必须由 ASCII 字母(
A–Z
、a–z
)和下划线(_
)组成,并且必须是 非二元属性名称 之一。 value
-
Unicode 属性值。必须由 ASCII 字母 (
A–Z
、a–z
)、下划线 (_
) 和数字 (0–9
) 组成,并且必须是PropertyValueAliases.txt
中列出的受支持值之一。
描述
¥Description
仅 Unicode 识别模式 支持 \p
和 \P
。在 Unicode 不识别模式下,它们是 身份逃避 代表 p
或 P
字符。
¥\p
and \P
are only supported in Unicode-aware mode. In Unicode-unaware mode, they are identity escapes for the p
or P
character.
每个 Unicode 字符都有一组描述它的属性。例如,字符 a
具有值为 Lowercase_Letter
的 General_Category
属性和值为 Latn
的 Script
属性。\p
和 \P
转义序列允许你根据字符的属性来匹配字符。例如,a
可以与 \p{Lowercase_Letter}
(General_Category
属性名称是可选的)以及 \p{Script=Latn}
匹配。\P
创建一个由不带指定属性的代码点组成的补集类。
¥Every Unicode character has a set of properties that describe it. For example, the character a
has the General_Category
property with value Lowercase_Letter
, and the Script
property with value Latn
. The \p
and \P
escape sequences allow you to match a character based on its properties. For example, a
can be matched by \p{Lowercase_Letter}
(the General_Category
property name is optional) as well as \p{Script=Latn}
. \P
creates a complement class that consists of code points without the specified property.
要组合多个属性,请使用通过 v
标志启用的 字符集交集 语法,或参见 模式减法和交集。
¥To compose multiple properties, use the character set intersection syntax enabled with the v
flag, or see pattern subtraction and intersection.
在 v
模式下,\p
可以匹配在 Unicode 中定义为 "字符串的属性" 的代码点序列。这对于表情符号最有用,因为表情符号通常由多个代码点组成。然而 \P
只能补充角色属性。
¥In v
mode, \p
may match a sequence of code points, defined in Unicode as "properties of strings". This is most useful for emojis, which are often composed of multiple code points. However, \P
can only complement character properties.
注意:还计划将字符串功能的属性移植到
u
模式。¥Note: There are plans to port the properties of strings feature to
u
mode as well.
示例
一般类别
¥General categories
常规类别用于对 Unicode 字符进行分类,子类别可用于定义更精确的分类。在 Unicode 属性转义中可以使用短形式或长形式。
¥General categories are used to classify Unicode characters and subcategories are available to define a more precise categorization. It is possible to use both short or long forms in Unicode property escapes.
它们可用于匹配字母、数字、符号、标点符号、空格等。有关一般类别的更详尽列表,请参阅 统一码规范。
¥They can be used to match letters, numbers, symbols, punctuations, spaces, etc. For a more exhaustive list of general categories, please refer to the Unicode specification.
// finding all the letters of a text
const story = "It's the Cheshire Cat: now I shall have somebody to talk to.";
// Most explicit form
story.match(/\p{General_Category=Letter}/gu);
// It is not mandatory to use the property name for General categories
story.match(/\p{Letter}/gu);
// This is equivalent (short alias):
story.match(/\p{L}/gu);
// This is also equivalent (conjunction of all the subcategories using short aliases)
story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);
脚本和脚本扩展
¥Scripts and script extensions
有些语言的书写系统使用不同的文字。例如,英语和西班牙语是使用拉丁字母书写的,而阿拉伯语和俄语是使用其他字母书写的(分别是阿拉伯语和西里尔字母)。Script
和 Script_Extensions
Unicode 属性允许正则表达式根据主要使用的脚本 (Script
) 或根据它们所属的脚本集 (Script_Extensions
) 来匹配字符。
¥Some languages use different scripts for their writing system. For instance, English and Spanish are written using the Latin script while Arabic and Russian are written with other scripts (respectively Arabic and Cyrillic). The Script
and Script_Extensions
Unicode properties allow regular expression to match characters according to the script they are mainly used with (Script
) or according to the set of scripts they belong to (Script_Extensions
).
例如,A
属于 Latin
脚本,ε
属于 Greek
脚本。
¥For example, A
belongs to the Latin
script and ε
to the Greek
script.
const mixedCharacters = "aεЛ";
// Using the canonical "long" name of the script
mixedCharacters.match(/\p{Script=Latin}/u); // a
// Using a short alias (ISO 15924 code) for the script
mixedCharacters.match(/\p{Script=Grek}/u); // ε
// Using the short name sc for the Script property
mixedCharacters.match(/\p{sc=Cyrillic}/u); // Л
有关详细信息,请参阅 统一码规范、ECMAScript 规范中的脚本表 和 ISO 15924 脚本代码列表。
¥For more details, refer to the Unicode specification, the Scripts table in the ECMAScript specification, and the ISO 15924 list of script codes.
如果在有限的脚本集中使用某个字符,则 Script
属性将仅与 "predominant" 使用的脚本匹配。如果我们想根据 "non-predominant" 脚本匹配字符,我们可以使用 Script_Extensions
属性(简称 Scx
)。
¥If a character is used in a limited set of scripts, the Script
property will only match for the "predominant" used script. If we want to match characters based on a "non-predominant" script, we could use the Script_Extensions
property (Scx
for short).
// ٢ is the digit 2 in Arabic-Indic notation
// while it is predominantly written within the Arabic script
// it can also be written in the Thaana script
"٢".match(/\p{Script=Thaana}/u);
// null as Thaana is not the predominant script
"٢".match(/\p{Script_Extensions=Thaana}/u);
// ["٢", index: 0, input: "٢", groups: undefined]
Unicode 属性转义与字符类
¥Unicode property escapes vs. character classes
使用 JavaScript 正则表达式,还可以使用 字符类,尤其是 \w
或 \d
来匹配字母或数字。但是,此类形式仅匹配拉丁字母中的字符(换句话说,a
到 z
和 A
到 Z
对应 \w
,0
到 9
对应 \d
)。如 这个例子 所示,处理非拉丁文本可能有点笨拙。
¥With JavaScript regular expressions, it is also possible to use character classes and especially \w
or \d
to match letters or digits. However, such forms only match characters from the Latin script (in other words, a
to z
and A
to Z
for \w
and 0
to 9
for \d
). As shown in this example, it might be a bit clumsy to work with non Latin texts.
Unicode 属性转义类别包含更多字符,\p{Letter}
或 \p{Number}
适用于任何脚本。
¥Unicode property escapes categories encompass much more characters and \p{Letter}
or \p{Number}
will work for any script.
// Trying to use ranges to avoid \w limitations:
const nonEnglishText = "Приключения Алисы в Стране чудес";
const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
// BMP goes through U+0000 to U+FFFF but space is U+0020
console.table(nonEnglishText.match(regexpBMPWord));
// Using Unicode property escapes instead
const regexpUPE = /\p{L}+/gu;
console.table(nonEnglishText.match(regexpUPE));
配套价格
¥Matching prices
以下示例匹配字符串中的价格:
¥The following example matches prices in a string:
function getPrices(str) {
// Sc stands for "currency symbol"
return [...str.matchAll(/\p{Sc}\s*[\d.,]+/gu)].map((match) => match[0]);
}
const str = `California rolls $6.99
Crunchy rolls $8.49
Shrimp tempura $10.99`;
console.log(getPrices(str)); // ["$6.99", "$8.49", "$10.99"]
const str2 = `US store $19.99
Europe store €18.99
Japan store ¥2000`;
console.log(getPrices(str2)); // ["$19.99", "€18.99", "¥2000"]
匹配字符串
¥Matching strings
借助 v
标志,\p{…}
可以通过使用字符串的属性来匹配可能长于一个字符的字符串:
¥With the v
flag, \p{…}
can match strings that are potentially longer than one character by using a property of strings:
const flag = "🇺🇳";
console.log(flag.length); // 2
console.log(/\p{RGI_Emoji_Flag_Sequence}/v.exec(flag)); // [ '🇺🇳' ]
但是,你不能使用 \P
来匹配 "没有属性的字符串",因为不清楚应该消耗多少个字符。
¥However, you can't use \P
to match "a string that does not have a property", because it's unclear how many characters should be consumed.
/\P{RGI_Emoji_Flag_Sequence}/v; // SyntaxError: Invalid regular expression: Invalid property name
规范
Specification |
---|
ECMAScript Language Specification # prod-CharacterClassEscape |
浏览器兼容性
BCD tables only load in the browser
也可以看看
¥See also
- 字符类 指南
- 正则表达式
- 字符类:
[...]
、[^...]
- 字符类转义:
\d
,\D
,\w
,\W
,\s
,\S
- 字符转义:
\n
、\u{...}
- 析取:
|
- 维基百科上的 Unicode 字符属性
- ES2018:RegExp Unicode 属性转义 博士阿克塞尔·劳施梅尔 (2017)
- Unicode 正则表达式 § 属性
- 统一码实用程序:UnicodeSet
- v8.dev 上的 带有设置符号和字符串属性的 RegExp v 标志 (2022)