卿少納言

卿少納言

JavaScript & Japanese, Python & Polyglot, TypeScript & Translate.
zhihu
github
email
x

How to use regular expressions to match all related characters of Chinese characters

A seemingly simple requirement, but the widely circulated [u4e00-u9fa5] on the internet actually has some issues.

From the range of code tables provided in the History section of the Unicode page on Wikipedia, it can be seen that the widely circulated [u4e00-u9fa5] in the Chinese internet only accounts for the characters within the 【CJK Unified Ideographs】, but this part actually only occupies one-fifth of all Chinese characters. Although CJK Unified Ideographs are partitioned by usage frequency, if you are developing professional tools related to language and characters, it is best to comprehensively cover all related characters.

RangeNameNumber of Characters
U+4E00 - U+9FFFCJK Unified Ideographs20,992
U+3400 - U+4DBFCJK Unified Ideographs Extension A6,592
U+20000 - U+2A6DFCJK Unified Ideographs Extension B42,720
U+2A700 - U+2B738CJK Unified Ideographs Extension C4,154
U+2B740 - U+2B81DCJK Unified Ideographs Extension D222
U+2B820 - U+2CEA1CJK Unified Ideographs Extension E5,762
U+2CEB0 - U+2EBE0CJK Unified Ideographs Extension F7,473
U+30000 - U+3134ACJK Unified Ideographs Extension G4,939
U+31350 - U+323AFCJK Unified Ideographs Extension H4,192
U+2EBF0 - U+2EE5FCJK Unified Ideographs Extension I622
Total97,668

If you encounter issues related to Unicode, the best approach is to refer to the official documentation provided by Unicode Unicode 15.1 Character Code Charts, which mentions that characters related to Chinese characters, in addition to the above 【CJK Unified Ideographs】, also include symbols related to Chinese characters (radicals, phonetic symbols, pinyin), private areas, etc.

Here is an example implemented in Python:

"""Match characters related to Chinese characters"""
import re

# noinspection RegExpDuplicateCharacterInClass
ideographs_reg = re.compile(
    r"""(?P<cjk_unified_ideographs>[\u4E00-\u9FFF])|
        (?P<extension_a>[\u3400-\u4DBF])|
        (?P<extension_b>[\u20000-\u2A6DF])|
        (?P<extension_c>[\u2A700-\u2B738])|
        (?P<extension_d>[\u2B740-\u2B81D])|
        (?P<extension_e>[\u2B820-\u2CEA1])|
        (?P<extension_f>[\u2CEB0-\u2EBE0])|
        (?P<extension_g>[\u30000-\u23134A])|
        (?P<extension_h>[\u31350-\u323AF])|
        (?P<extension_i>[\u2EBF0-\u2EE5F])|
        (?P<compatibility_ideographs>[\uF900-\uFAFF])| # Compatibility area: ([\x{F900}-\x{FAD9}])
        (?P<compatibility_ideographs_supplement>[\u2F800-\u2FA1F])| # Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
        (?P<kangxi_radicals>[\u2F00-\u2FDF]) | # Kangxi radicals: ([\x{2F00}-\x{2FD5}])
        (?P<radicals_supplement>[\u2E80-\u2EFF]) | # Radical extension: ([\u2E80-\u2EF3])
        (?P<cjk_strokes>[\u31c0-\u31ef]) | # Chinese character strokes: ([\u31C0-\u31E3])
        (?P<ideographic_description_characters>[\u2FF0-\u2FFF]) # Chinese character structure: ([\u2FF0-\u2FFB])
        (?P<bopomofo>[\u3100-\u312F])| # Chinese phonetic symbols: ([\u3105-\u312F])
        (?P<bopomofo_extend>[\u31A0-\u31BF])| # Phonetic extension: ([\u31A0-\u31BA])
        (?P<private_use_area>[\uE000-\uF8FF])| # Private area: ([\uE000-\uF8FF])
        (?P<supplementary_private_use_area_a>[\uF0000-\uFFFFF])| # Private PUA-A: ([\uF0000-\uFFFFF])
        (?P<supplementary_private_use_area_b>[\u100000-\u10FFFD])| # Private PUA-B: ([\u100000-\u10FFFF])
"""
)


def match_chinese_ideographs(input_text: str) -> bool:
    """Match if it contains characters related to Chinese characters

    Args:
        input_text (str): The string to be matched

    Returns:
        Returns True if it contains characters related to Chinese characters
    """
    result = False
    for char in input_text:
        if re.search(ideographs_reg, char) is not None:
            print(re.match(ideographs_reg, char).groupdict())
            result = True
    return result

match_chinese_ideographs("食物")

Supplement#

Thanks to amob for the suggestion:

I just saw your blog post about regex matching Chinese characters, and I personally have some experience with it (having dealt with dictionaries with many rare characters). Currently, \p{han} is still based on Unicode 13.0, and many new Chinese characters cannot be matched. I haven't used \p{Unified_Ideograph}/u, but I estimate it cannot match some special cases.

According to the guidance of the expert jcz777 on the site, the most ideal matching code points are as follows:

Basic area: ([\x{3007}\x{4e00}-\x{9fff}])
Area A: ([\x{3400}-\x{4DBF}])
Area B: ([\x{20000}-\x{2A6DF}])
Area C: ([\x{2A700}-\x{2B73F}])
Area D: ([\x{2B740}-\x{2B81F}])
Area E: ([\x{2B820}-\x{2CEA1}])
Area F: ([\x{2CEB0}-\x{2EBE0}])
Area G: ([\x{30000}-\x{3134A}])
Area H: ([\x{31350}-\x{323AF}])
Area I: ([\x{2EBF0}-\x{2EE5F}])
Compatibility area: ([\x{F900}-\x{FAD9}])
Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
Kangxi radicals: ([\x{2F00}-\x{2FD5}])
Chinese character strokes: ([\x{31C0}-\x{31E3}])
Chinese character structure: ([\x{2FF0}-\x{2FFB}])
Chinese phonetic symbols: ([\x{3105}-\x{312F}])
Phonetic extension: ([\x{31A0}-\x{31BA}])
Radical extension: ([\x{2E80}-\x{2EF3}])
Private area: ([\x{E000}-\x{F8FF}])
Private PUA-A: ([\x{F0000}-\x{FFFFF}])
Private PUA-B: ([\x{100000}-\x{10FFFF}])

Adding these few,
𝍦、𝍳、𝍳、𝍴、𝍵
〡、〢、〣、〤、〥、〦、〧、〨、〩、〸

References#

JavaScript Regular Expressions for Matching Chinese Characters: [[Python]] does not have the methods mentioned in this article, such as /\p{Unified_Ideograph}/u and /\p{Script=Han}/u, and I feel there is value in writing a third-party library (shrug)

Python: Building Complex Regular Expressions in Combination: Introduces a method for constructing regular expressions that are easy to modify and debug in practical development scenarios.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.