A seemingly simple requirement, but the widely circulated [u4e00-u9fa5]
on the internet actually has some issues.
How to Match All Related Characters of Chinese Characters Using Regular Expressions#
From the range of code tables provided in the History section of the Unicode page on Wikipedia, it can be seen that the widely circulated [u4e00-u9fa5]
in the Chinese internet only accounts for the characters within the 【CJK Unified Ideographs】, but this part actually only occupies one-fifth of all Chinese characters. Although CJK Unified Ideographs are partitioned by usage frequency, if you are developing professional tools related to language and characters, it is best to comprehensively cover all related characters.
Range | Name | Number of Characters |
---|---|---|
U+4E00 - U+9FFF | CJK Unified Ideographs | 20,992 |
U+3400 - U+4DBF | CJK Unified Ideographs Extension A | 6,592 |
U+20000 - U+2A6DF | CJK Unified Ideographs Extension B | 42,720 |
U+2A700 - U+2B738 | CJK Unified Ideographs Extension C | 4,154 |
U+2B740 - U+2B81D | CJK Unified Ideographs Extension D | 222 |
U+2B820 - U+2CEA1 | CJK Unified Ideographs Extension E | 5,762 |
U+2CEB0 - U+2EBE0 | CJK Unified Ideographs Extension F | 7,473 |
U+30000 - U+3134A | CJK Unified Ideographs Extension G | 4,939 |
U+31350 - U+323AF | CJK Unified Ideographs Extension H | 4,192 |
U+2EBF0 - U+2EE5F | CJK Unified Ideographs Extension I | 622 |
Total | 97,668 |
If you encounter issues related to Unicode, the best approach is to refer to the official documentation provided by Unicode Unicode 15.1 Character Code Charts, which mentions that characters related to Chinese characters, in addition to the above 【CJK Unified Ideographs】, also include symbols related to Chinese characters (radicals, phonetic symbols, pinyin), private areas, etc.
Here is an example implemented in Python:
"""Match characters related to Chinese characters"""
import re
# noinspection RegExpDuplicateCharacterInClass
ideographs_reg = re.compile(
r"""(?P<cjk_unified_ideographs>[\u4E00-\u9FFF])|
(?P<extension_a>[\u3400-\u4DBF])|
(?P<extension_b>[\u20000-\u2A6DF])|
(?P<extension_c>[\u2A700-\u2B738])|
(?P<extension_d>[\u2B740-\u2B81D])|
(?P<extension_e>[\u2B820-\u2CEA1])|
(?P<extension_f>[\u2CEB0-\u2EBE0])|
(?P<extension_g>[\u30000-\u23134A])|
(?P<extension_h>[\u31350-\u323AF])|
(?P<extension_i>[\u2EBF0-\u2EE5F])|
(?P<compatibility_ideographs>[\uF900-\uFAFF])| # Compatibility area: ([\x{F900}-\x{FAD9}])
(?P<compatibility_ideographs_supplement>[\u2F800-\u2FA1F])| # Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
(?P<kangxi_radicals>[\u2F00-\u2FDF]) | # Kangxi radicals: ([\x{2F00}-\x{2FD5}])
(?P<radicals_supplement>[\u2E80-\u2EFF]) | # Radical extension: ([\u2E80-\u2EF3])
(?P<cjk_strokes>[\u31c0-\u31ef]) | # Chinese character strokes: ([\u31C0-\u31E3])
(?P<ideographic_description_characters>[\u2FF0-\u2FFF]) # Chinese character structure: ([\u2FF0-\u2FFB])
(?P<bopomofo>[\u3100-\u312F])| # Chinese phonetic symbols: ([\u3105-\u312F])
(?P<bopomofo_extend>[\u31A0-\u31BF])| # Phonetic extension: ([\u31A0-\u31BA])
(?P<private_use_area>[\uE000-\uF8FF])| # Private area: ([\uE000-\uF8FF])
(?P<supplementary_private_use_area_a>[\uF0000-\uFFFFF])| # Private PUA-A: ([\uF0000-\uFFFFF])
(?P<supplementary_private_use_area_b>[\u100000-\u10FFFD])| # Private PUA-B: ([\u100000-\u10FFFF])
"""
)
def match_chinese_ideographs(input_text: str) -> bool:
"""Match if it contains characters related to Chinese characters
Args:
input_text (str): The string to be matched
Returns:
Returns True if it contains characters related to Chinese characters
"""
result = False
for char in input_text:
if re.search(ideographs_reg, char) is not None:
print(re.match(ideographs_reg, char).groupdict())
result = True
return result
match_chinese_ideographs("食物")
Supplement#
Thanks to amob for the suggestion:
I just saw your blog post about regex matching Chinese characters, and I personally have some experience with it (having dealt with dictionaries with many rare characters). Currently, \p{han} is still based on Unicode 13.0, and many new Chinese characters cannot be matched. I haven't used \p{Unified_Ideograph}/u, but I estimate it cannot match some special cases.
According to the guidance of the expert jcz777 on the site, the most ideal matching code points are as follows:
Basic area: ([\x{3007}\x{4e00}-\x{9fff}])
Area A: ([\x{3400}-\x{4DBF}])
Area B: ([\x{20000}-\x{2A6DF}])
Area C: ([\x{2A700}-\x{2B73F}])
Area D: ([\x{2B740}-\x{2B81F}])
Area E: ([\x{2B820}-\x{2CEA1}])
Area F: ([\x{2CEB0}-\x{2EBE0}])
Area G: ([\x{30000}-\x{3134A}])
Area H: ([\x{31350}-\x{323AF}])
Area I: ([\x{2EBF0}-\x{2EE5F}])
Compatibility area: ([\x{F900}-\x{FAD9}])
Compatibility extension area: ([\x{2F800}-\x{2FA1D}])
Kangxi radicals: ([\x{2F00}-\x{2FD5}])
Chinese character strokes: ([\x{31C0}-\x{31E3}])
Chinese character structure: ([\x{2FF0}-\x{2FFB}])
Chinese phonetic symbols: ([\x{3105}-\x{312F}])
Phonetic extension: ([\x{31A0}-\x{31BA}])
Radical extension: ([\x{2E80}-\x{2EF3}])
Private area: ([\x{E000}-\x{F8FF}])
Private PUA-A: ([\x{F0000}-\x{FFFFF}])
Private PUA-B: ([\x{100000}-\x{10FFFF}])
Adding these few,
𝍦、𝍳、𝍳、𝍴、𝍵
〡、〢、〣、〤、〥、〦、〧、〨、〩、〸
References#
JavaScript Regular Expressions for Matching Chinese Characters: [[Python]] does not have the methods mentioned in this article, such as /\p{Unified_Ideograph}/u
and /\p{Script=Han}/u
, and I feel there is value in writing a third-party library (shrug)
Python: Building Complex Regular Expressions in Combination: Introduces a method for constructing regular expressions that are easy to modify and debug in practical development scenarios.