String.prototype.normalize()

String 值的 normalize() 方法返回该字符串的 Unicode 规范化形式。

¥The normalize() method of String values returns the Unicode Normalization Form of this string.

Try it

语法

¥Syntax

js
normalize()
normalize(form)

参数

¥Parameters

form Optional

"NFC""NFD""NFKC""NFKD" 之一,指定 Unicode 规范化形式。如果省略或使用 undefined"NFC"

这些值具有以下含义:

"NFC"

规范分解,然后是规范组合。

"NFD"

规范分解。

"NFKC"

兼容性分解,然后是规范组合。

"NFKD"

兼容性分解。

返回值

¥Return value

包含给定字符串的 Unicode 规范化形式的字符串。

¥A string containing the Unicode Normalization Form of the given string.

例外情况

¥Exceptions

RangeError

如果 form 不是上面指定的值之一,则抛出该错误。

描述

¥Description

Unicode 为每个字符分配一个唯一的数值,称为代码点。例如,"A" 的代码点指定为 U+0041。然而,有时多个代码点或代码点序列可以表示同一个抽象字符 - 例如,字符 "ñ" 可以由以下任一表示:

¥Unicode assigns a unique numerical value, called a code point, to each character. For example, the code point for "A" is given as U+0041. However, sometimes more than one code point, or sequence of code points, can represent the same abstract character — the character "ñ" for example can be represented by either of:

  • 单个代码点 U+00F1。
  • "n" (U+006E) 的代码点后跟组合波形符 (U+0303) 的代码点。
js
const string1 = "\u00F1";
const string2 = "\u006E\u0303";

console.log(string1); // ñ
console.log(string2); // ñ

但是,由于代码点不同,字符串比较不会将它们视为相等。而且由于每个版本中的码点数量不同,因此它们的长度也不同。

¥However, since the code points are different, string comparison will not treat them as equal. And since the number of code points in each version is different, they even have different lengths.

js
const string1 = "\u00F1"; // ñ
const string2 = "\u006E\u0303"; // ñ

console.log(string1 === string2); // false
console.log(string1.length); // 1
console.log(string2.length); // 2

normalize() 方法通过将字符串转换为表示相同字符的所有代码点序列通用的规范化形式来帮助解决此问题。有两种主要的规范化形式,一种基于规范等效性,另一种基于兼容性。

¥The normalize() method helps solve this problem by converting a string into a normalized form common for all sequences of code points that represent the same characters. There are two main normalization forms, one based on canonical equivalence and the other based on compatibility.

规范等价标准化

¥Canonical equivalence normalization

在 Unicode 中,如果两个代码点序列表示相同的抽象字符,则它们具有规范等价性,并且应始终具有相同的视觉外观和行为(例如,它们应始终以相同的方式排序)。

¥In Unicode, two sequences of code points have canonical equivalence if they represent the same abstract characters, and should always have the same visual appearance and behavior (for example, they should always be sorted in the same way).

你可以使用 normalize() 并结合 "NFD""NFC" 参数来生成对于所有规范等效字符串都相同的字符串形式。在下面的示例中,我们规范化字符 "ñ" 的两种表示形式:

¥You can use normalize() using the "NFD" or "NFC" arguments to produce a form of the string that will be the same for all canonically equivalent strings. In the example below we normalize two representations of the character "ñ":

js
let string1 = "\u00F1"; // ñ
let string2 = "\u006E\u0303"; // ñ

string1 = string1.normalize("NFD");
string2 = string2.normalize("NFD");

console.log(string1 === string2); // true
console.log(string1.length); // 2
console.log(string2.length); // 2

组合形式和分解形式

¥Composed and decomposed forms

请注意,"NFD" 下的标准化形式的长度为 2。这是因为 "NFD" 为你提供了规范形式的分解版本,其中单个代码点被拆分为多个组合代码点。"ñ" 的分解规范形式是 "\u006E\u0303"

¥Note that the length of the normalized form under "NFD" is 2. That's because "NFD" gives you the decomposed version of the canonical form, in which single code points are split into multiple combining ones. The decomposed canonical form for "ñ" is "\u006E\u0303".

你可以指定 "NFC" 来获取组合的规范形式,其中多个代码点尽可能替换为单个代码点。"ñ" 的组合规范形式是 "\u00F1"

¥You can specify "NFC" to get the composed canonical form, in which multiple code points are replaced with single code points where possible. The composed canonical form for "ñ" is "\u00F1":

js
let string1 = "\u00F1"; // ñ
let string2 = "\u006E\u0303"; // ñ

string1 = string1.normalize("NFC");
string2 = string2.normalize("NFC");

console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
console.log(string2.codePointAt(0).toString(16)); // f1

兼容性标准化

¥Compatibility normalization

在 Unicode 中,如果两个代码点序列表示相同的抽象字符,则它们是兼容的,并且在某些(但不一定是所有)应用中应以相同方式对待。

¥In Unicode, two sequences of code points are compatible if they represent the same abstract characters, and should be treated alike in some — but not necessarily all — applications.

所有规范等效序列也是兼容的,但反之则不然。

¥All canonically equivalent sequences are also compatible, but not vice versa.

例如:

¥For example:

  • 代码点 U+FB00 代表 ligature "ff"。它与两个连续的 U+0066 代码点 ("ff") 兼容。
  • 代码点 U+24B9 代表符号 "Ⓓ"。它与 U+0044 代码点 ("D") 兼容。

在某些方面(例如排序)它们应该被视为等效,而在某些方面(例如视觉外观)它们不应该被视为等效,因此它们在规范上不是等效的。

¥In some respects (such as sorting) they should be treated as equivalent—and in some (such as visual appearance) they should not, so they are not canonically equivalent.

你可以使用 normalize() 并结合 "NFKD""NFKC" 参数来生成对于所有兼容字符串都相同的字符串形式:

¥You can use normalize() using the "NFKD" or "NFKC" arguments to produce a form of the string that will be the same for all compatible strings:

js
let string1 = "\uFB00";
let string2 = "\u0066\u0066";

console.log(string1); // ff
console.log(string2); // ff
console.log(string1 === string2); // false
console.log(string1.length); // 1
console.log(string2.length); // 2

string1 = string1.normalize("NFKD");
string2 = string2.normalize("NFKD");

console.log(string1); // ff <- visual appearance changed
console.log(string2); // ff
console.log(string1 === string2); // true
console.log(string1.length); // 2
console.log(string2.length); // 2

应用兼容性规范化时,考虑你打算如何处理字符串非常重要,因为规范化形式可能并不适合所有应用。在上面的示例中,规范化适用于搜索,因为它使用户能够通过搜索 "f" 来查找字符串。但它可能不适合显示,因为视觉表示不同。

¥When applying compatibility normalization it's important to consider what you intend to do with the strings, since the normalized form may not be appropriate for all applications. In the example above the normalization is appropriate for search, because it enables a user to find the string by searching for "f". But it may not be appropriate for display, because the visual representation is different.

与规范规范化一样,你可以分别通过传递 "NFKD""NFKC" 来请求分解或组合的兼容形式。

¥As with canonical normalization, you can ask for decomposed or composed compatible forms by passing "NFKD" or "NFKC", respectively.

示例

¥Examples

使用标准化()

¥Using normalize()

js
// Initial string

// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
const str = "\u1E9B\u0323";

// Canonically-composed form (NFC)

// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
str.normalize("NFC"); // '\u1E9B\u0323'
str.normalize(); // same as above

// Canonically-decomposed form (NFD)

// U+017F: LATIN SMALL LETTER LONG S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize("NFD"); // '\u017F\u0323\u0307'

// Compatibly-composed (NFKC)

// U+1E69: LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
str.normalize("NFKC"); // '\u1E69'

// Compatibly-decomposed (NFKD)

// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize("NFKD"); // '\u0073\u0323\u0307'

规范

Specification
ECMAScript Language Specification
# sec-string.prototype.normalize

¥Specifications

浏览器兼容性

BCD tables only load in the browser

¥Browser compatibility

也可以看看