Unicode 字符类转义:\\p{...},\\P{...}

unicode 字符类转义是一种 字符类转义,它匹配由 Unicode 属性指定的一组字符。仅 Unicode 识别模式 支持。当 v 标志启用时,它还可以用于匹配有限长度的字符串。

¥A unicode character class escape is a kind of character class escape that matches a set of characters specified by a Unicode property. It's only supported in Unicode-aware mode. When the v flag is enabled, it can also be used to match finite-length strings.

Try it

语法

¥Syntax

regex
\p{loneProperty}
\P{loneProperty}

\p{property=value}
\P{property=value}

参数

¥Parameters

loneProperty

单独的 Unicode 属性名称或值,遵循与 value 相同的语法。它指定 General_Category 属性或 二进制属性名称 的值。在 v 模式下,它也可以是 字符串的二进制 Unicode 属性

注意:ICU 语法也允许省略 Script 属性名称,但 JavaScript 不支持这一点,因为大多数时候 Script_ExtensionsScript 更有用。

¥Note: ICU syntax allows omitting the Script property name as well, but JavaScript does not support this, because most of the time Script_Extensions is more useful than Script.

property

Unicode 属性名称。必须由 ASCII 字母(A–Za–z)和下划线(_)组成,并且必须是 非二元属性名称 之一。

value

Unicode 属性值。必须由 ASCII 字母 (A–Za–z)、下划线 (_) 和数字 (0–9) 组成,并且必须是 PropertyValueAliases.txt 中列出的受支持值之一。

描述

¥Description

Unicode 识别模式 支持 \p\P。在 Unicode 不识别模式下,它们是 身份逃避 代表 pP 字符。

¥\p and \P are only supported in Unicode-aware mode. In Unicode-unaware mode, they are identity escapes for the p or P character.

每个 Unicode 字符都有一组描述它的属性。例如,字符 a 具有值为 Lowercase_LetterGeneral_Category 属性和值为 LatnScript 属性。\p\P 转义序列允许你根据字符的属性来匹配字符。例如,a 可以与 \p{Lowercase_Letter}General_Category 属性名称是可选的)以及 \p{Script=Latn} 匹配。\P 创建一个由不带指定属性的代码点组成的补集类。

¥Every Unicode character has a set of properties that describe it. For example, the character a has the General_Category property with value Lowercase_Letter, and the Script property with value Latn. The \p and \P escape sequences allow you to match a character based on its properties. For example, a can be matched by \p{Lowercase_Letter} (the General_Category property name is optional) as well as \p{Script=Latn}. \P creates a complement class that consists of code points without the specified property.

要组合多个属性,请使用通过 v 标志启用的 字符集交集 语法,或参见 模式减法和交集

¥To compose multiple properties, use the character set intersection syntax enabled with the v flag, or see pattern subtraction and intersection.

v 模式下,\p 可以匹配在 Unicode 中定义为 "字符串的属性" 的代码点序列。这对于表情符号最有用,因为表情符号通常由多个代码点组成。然而 \P 只能补充角色属性。

¥In v mode, \p may match a sequence of code points, defined in Unicode as "properties of strings". This is most useful for emojis, which are often composed of multiple code points. However, \P can only complement character properties.

注意:还计划将字符串功能的属性移植到 u 模式。

¥Note: There are plans to port the properties of strings feature to u mode as well.

示例

¥Examples

一般类别

¥General categories

常规类别用于对 Unicode 字符进行分类,子类别可用于定义更精确的分类。在 Unicode 属性转义中可以使用短形式或长形式。

¥General categories are used to classify Unicode characters and subcategories are available to define a more precise categorization. It is possible to use both short or long forms in Unicode property escapes.

它们可用于匹配字母、数字、符号、标点符号、空格等。有关一般类别的更详尽列表,请参阅 统一码规范

¥They can be used to match letters, numbers, symbols, punctuations, spaces, etc. For a more exhaustive list of general categories, please refer to the Unicode specification.

js
// finding all the letters of a text
const story = "It's the Cheshire Cat: now I shall have somebody to talk to.";

// Most explicit form
story.match(/\p{General_Category=Letter}/gu);

// It is not mandatory to use the property name for General categories
story.match(/\p{Letter}/gu);

// This is equivalent (short alias):
story.match(/\p{L}/gu);

// This is also equivalent (conjunction of all the subcategories using short aliases)
story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);

脚本和脚本扩展

¥Scripts and script extensions

有些语言的书写系统使用不同的文字。例如,英语和西班牙语是使用拉丁字母书写的,而阿拉伯语和俄语是使用其他字母书写的(分别是阿拉伯语和西里尔字母)。ScriptScript_Extensions Unicode 属性允许正则表达式根据主要使用的脚本 (Script) 或根据它们所属的脚本集 (Script_Extensions) 来匹配字符。

¥Some languages use different scripts for their writing system. For instance, English and Spanish are written using the Latin script while Arabic and Russian are written with other scripts (respectively Arabic and Cyrillic). The Script and Script_Extensions Unicode properties allow regular expression to match characters according to the script they are mainly used with (Script) or according to the set of scripts they belong to (Script_Extensions).

例如,A 属于 Latin 脚本,ε 属于 Greek 脚本。

¥For example, A belongs to the Latin script and ε to the Greek script.

js
const mixedCharacters = "aεЛ";

// Using the canonical "long" name of the script
mixedCharacters.match(/\p{Script=Latin}/u); // a

// Using a short alias (ISO 15924 code) for the script
mixedCharacters.match(/\p{Script=Grek}/u); // ε

// Using the short name sc for the Script property
mixedCharacters.match(/\p{sc=Cyrillic}/u); // Л

有关详细信息,请参阅 统一码规范ECMAScript 规范中的脚本表ISO 15924 脚本代码列表

¥For more details, refer to the Unicode specification, the Scripts table in the ECMAScript specification, and the ISO 15924 list of script codes.

如果在有限的脚本集中使用某个字符,则 Script 属性将仅与 "predominant" 使用的脚本匹配。如果我们想根据 "non-predominant" 脚本匹配字符,我们可以使用 Script_Extensions 属性(简称 Scx)。

¥If a character is used in a limited set of scripts, the Script property will only match for the "predominant" used script. If we want to match characters based on a "non-predominant" script, we could use the Script_Extensions property (Scx for short).

js
// ٢ is the digit 2 in Arabic-Indic notation
// while it is predominantly written within the Arabic script
// it can also be written in the Thaana script

"٢".match(/\p{Script=Thaana}/u);
// null as Thaana is not the predominant script

"٢".match(/\p{Script_Extensions=Thaana}/u);
// ["٢", index: 0, input: "٢", groups: undefined]

Unicode 属性转义与字符类

¥Unicode property escapes vs. character classes

使用 JavaScript 正则表达式,还可以使用 字符类,尤其是 \w\d 来匹配字母或数字。但是,此类形式仅匹配拉丁字母中的字符(换句话说,azAZ 对应 \w09 对应 \d)。如 这个例子 所示,处理非拉丁文本可能有点笨拙。

¥With JavaScript regular expressions, it is also possible to use character classes and especially \w or \d to match letters or digits. However, such forms only match characters from the Latin script (in other words, a to z and A to Z for \w and 0 to 9 for \d). As shown in this example, it might be a bit clumsy to work with non Latin texts.

Unicode 属性转义类别包含更多字符,\p{Letter}\p{Number} 适用于任何脚本。

¥Unicode property escapes categories encompass much more characters and \p{Letter} or \p{Number} will work for any script.

js
// Trying to use ranges to avoid \w limitations:

const nonEnglishText = "Приключения Алисы в Стране чудес";
const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
// BMP goes through U+0000 to U+FFFF but space is U+0020

console.table(nonEnglishText.match(regexpBMPWord));

// Using Unicode property escapes instead
const regexpUPE = /\p{L}+/gu;
console.table(nonEnglishText.match(regexpUPE));

配套价格

¥Matching prices

以下示例匹配字符串中的价格:

¥The following example matches prices in a string:

js
function getPrices(str) {
  // Sc stands for "currency symbol"
  return [...str.matchAll(/\p{Sc}\s*[\d.,]+/gu)].map((match) => match[0]);
}

const str = `California rolls $6.99
Crunchy rolls $8.49
Shrimp tempura $10.99`;
console.log(getPrices(str)); // ["$6.99", "$8.49", "$10.99"]

const str2 = `US store $19.99
Europe store €18.99
Japan store ¥2000`;
console.log(getPrices(str2)); // ["$19.99", "€18.99", "¥2000"]

匹配字符串

¥Matching strings

借助 v 标志,\p{…} 可以通过使用字符串的属性来匹配可能长于一个字符的字符串:

¥With the v flag, \p{…} can match strings that are potentially longer than one character by using a property of strings:

js
const flag = "🇺🇳";
console.log(flag.length); // 2
console.log(/\p{RGI_Emoji_Flag_Sequence}/v.exec(flag)); // [ '🇺🇳' ]

但是,你不能使用 \P 来匹配 "没有属性的字符串",因为不清楚应该消耗多少个字符。

¥However, you can't use \P to match "a string that does not have a property", because it's unclear how many characters should be consumed.

js
/\P{RGI_Emoji_Flag_Sequence}/v; // SyntaxError: Invalid regular expression: Invalid property name

规范

Specification
ECMAScript Language Specification
# prod-CharacterClassEscape

¥Specifications

浏览器兼容性

BCD tables only load in the browser

¥Browser compatibility

也可以看看