How to use regular expressions to match all related characters of Chinese characters

A seemingly simple requirement, but the widely circulated [u4e00-u9fa5] on the internet actually has some issues.

From the range of code tables provided in the History section of the Unicode page on Wikipedia, it can be seen that the widely circulated [u4e00-u9fa5] in the Chinese internet only accounts for the characters within the 【CJK Unified Ideographs】, but this part actually only occupies one-fifth of all Chinese characters. Although CJK Unified Ideographs are partitioned by usage frequency, if you are developing professional tools related to language and characters, it is best to comprehensively cover all related characters.

Range	Name	Number of Characters
U+4E00 - U+9FFF	CJK Unified Ideographs	20,992
U+3400 - U+4DBF	CJK Unified Ideographs Extension A	6,592
U+20000 - U+2A6DF	CJK Unified Ideographs Extension B	42,720
U+2A700 - U+2B738	CJK Unified Ideographs Extension C	4,154
U+2B740 - U+2B81D	CJK Unified Ideographs Extension D	222
U+2B820 - U+2CEA1	CJK Unified Ideographs Extension E	5,762
U+2CEB0 - U+2EBE0	CJK Unified Ideographs Extension F	7,473
U+30000 - U+3134A	CJK Unified Ideographs Extension G	4,939
U+31350 - U+323AF	CJK Unified Ideographs Extension H	4,192
U+2EBF0 - U+2EE5F	CJK Unified Ideographs Extension I	622
Total		97,668

If you encounter issues related to Unicode, the best approach is to refer to the official documentation provided by Unicode Unicode 15.1 Character Code Charts, which mentions that characters related to Chinese characters, in addition to the above 【CJK Unified Ideographs】, also include symbols related to Chinese characters (radicals, phonetic symbols, pinyin), private areas, etc.

Here is an example implemented in Python:

"""Match characters related to Chinese characters"""
import re

# noinspection RegExpDuplicateCharacterInClass
ideographs_reg = re.compile(
    r"""(?P<cjk_unified_ideographs>[\u4E00-\u9FFF])|
        (?P<extension_a>[\u3400-\u4DBF])|
        (?P<extension_b>[\u20000-\u2A6DF])|
        (?P<extension_c>[\u2A700-\u2B738])|
        (?P<extension_d>[\u2B740-\u2B81D])|
        (?P<extension_e>[\u2B820-\u2CEA1])|
        (?P<extension_f>[\u2CEB0-\u2EBE0])|
        (?P<extension_g>[\u30000-\u23134A])|
        (?P<extension_h>[\u31350-\u323AF])|
        (?P<extension_i>[\u2EBF0-\u2EE5F])|
        (?P<compatibility_ideographs>[\uF900-\uFAFF])| # Compatibility area: ([\x{F900}-\x{FAD9}])
        (?P<compatibility_ideographs_supplement>[\u2F800-\u2FA1F])| # Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
        (?P<kangxi_radicals>[\u2F00-\u2FDF]) | # Kangxi radicals: ([\x{2F00}-\x{2FD5}])
        (?P<radicals_supplement>[\u2E80-\u2EFF]) | # Radical extension: ([\u2E80-\u2EF3])
        (?P<cjk_strokes>[\u31c0-\u31ef]) | # Chinese character strokes: ([\u31C0-\u31E3])
        (?P<ideographic_description_characters>[\u2FF0-\u2FFF]) # Chinese character structure: ([\u2FF0-\u2FFB])
        (?P<bopomofo>[\u3100-\u312F])| # Chinese phonetic symbols: ([\u3105-\u312F])
        (?P<bopomofo_extend>[\u31A0-\u31BF])| # Phonetic extension: ([\u31A0-\u31BA])
        (?P<private_use_area>[\uE000-\uF8FF])| # Private area: ([\uE000-\uF8FF])
        (?P<supplementary_private_use_area_a>[\uF0000-\uFFFFF])| # Private PUA-A: ([\uF0000-\uFFFFF])
        (?P<supplementary_private_use_area_b>[\u100000-\u10FFFD])| # Private PUA-B: ([\u100000-\u10FFFF])
"""
)


def match_chinese_ideographs(input_text: str) -> bool:
    """Match if it contains characters related to Chinese characters

    Args:
        input_text (str): The string to be matched

    Returns:
        Returns True if it contains characters related to Chinese characters
    """
    result = False
    for char in input_text:
        if re.search(ideographs_reg, char) is not None:
            print(re.match(ideographs_reg, char).groupdict())
            result = True
    return result

match_chinese_ideographs("食物")

Supplement#

Thanks to amob for the suggestion:

I just saw your blog post about regex matching Chinese characters, and I personally have some experience with it (having dealt with dictionaries with many rare characters). Currently, \p{han} is still based on Unicode 13.0, and many new Chinese characters cannot be matched. I haven't used \p{Unified_Ideograph}/u, but I estimate it cannot match some special cases.

According to the guidance of the expert jcz777 on the site, the most ideal matching code points are as follows:

Basic area: ([\x{3007}\x{4e00}-\x{9fff}])
Area A: ([\x{3400}-\x{4DBF}])
Area B: ([\x{20000}-\x{2A6DF}])
Area C: ([\x{2A700}-\x{2B73F}])
Area D: ([\x{2B740}-\x{2B81F}])
Area E: ([\x{2B820}-\x{2CEA1}])
Area F: ([\x{2CEB0}-\x{2EBE0}])
Area G: ([\x{30000}-\x{3134A}])
Area H: ([\x{31350}-\x{323AF}])
Area I: ([\x{2EBF0}-\x{2EE5F}])
Compatibility area: ([\x{F900}-\x{FAD9}])
Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
Kangxi radicals: ([\x{2F00}-\x{2FD5}])
Chinese character strokes: ([\x{31C0}-\x{31E3}])
Chinese character structure: ([\x{2FF0}-\x{2FFB}])
Chinese phonetic symbols: ([\x{3105}-\x{312F}])
Phonetic extension: ([\x{31A0}-\x{31BA}])
Radical extension: ([\x{2E80}-\x{2EF3}])
Private area: ([\x{E000}-\x{F8FF}])
Private PUA-A: ([\x{F0000}-\x{FFFFF}])
Private PUA-B: ([\x{100000}-\x{10FFFF}])

Adding these few,
𝍦、𝍳、𝍳、𝍴、𝍵
〡、〢、〣、〤、〥、〦、〧、〨、〩、〸

References#

JavaScript Regular Expressions for Matching Chinese Characters: [[Python]] does not have the methods mentioned in this article, such as /\p{Unified_Ideograph}/u and /\p{Script=Han}/u, and I feel there is value in writing a third-party library (shrug)

Python: Building Complex Regular Expressions in Combination: Introduces a method for constructing regular expressions that are easy to modify and debug in practical development scenarios.