Abstract: This article analyzes the morphological rules of GoldenDict using the hunspell_ja_JP project files as an example. (Well, this article is quite technical, and I don't really know how to write an abstract, 2333)
Introduction#
This article uses the morphological files from the MrCorn0-0/hunspell_ja_JP: Hunspell morphology dictionary for Japanese used in GoldenDict. (github.com) project (hereinafter referred to as the "original project") as an example, referencing the man 4 hunspell PDF file provided by the Linux hunspell official website (hereinafter referred to as the "manual") to explain the basic morphological rules.
A preliminary statement: I am not very knowledgeable and have only roughly understood some rules of the original project's morphological files. There may be errors, so I hope everyone reads rationally and communicates kindly. Additionally, the original project is only for Japanese; if you have morphological questions for other languages, I may not be able to answer them, so please refer to the manual.
The morphological functionality of GoldenDict primarily uses Hunspell, which is originally a spell-checking tool for Linux. Therefore, there are slight differences from what is mentioned in Introducing Hunspell to German Language Students. GoldenDict's morphology only includes two files with the extensions .dic
and .aff
, and does not include files with the extensions .morph
, .good
, .wrong
, or .sug
. However, the rules for .dic
and .aff
files are completely consistent with those in the Hunspell manual.
Regarding the functions of these two files, I quote from Introducing Hunspell to German Language Students:
- The .dic dictionary file contains entries similar to the headwords (lexemes) in a printed dictionary.
- The .aff file contains a set of rules for restoring forms, detailing how to convert a complex form with prefixes, suffixes, or compounded with other words into a headword that exists in the dictionary.
Basic Format of dic Files#
The dic file is relatively simple, structured like the following (the initial number indicates the number of entries in the dic file):
450851
あ
亜
亞
吾
我
高い/XA
Some words have a /
followed by what appears to be gibberish; this notation can be understood as indicating the part of speech of the word, indicating that this word has a transformation rule named XA
, which will be explained in detail later. Here, it is only necessary to know that the dic file can specify the part of speech of a word.
Additionally, the words included in GoldenDict's dic file will affect the final lookup results when the morphological function is enabled, so it is best to treat the words in this file with caution and not modify them recklessly.
aff File#
The rules for the aff file are very complex. Below, I will only explain the rules used in the original project. Those studying other languages or with other needs should refer to the man 4 Hunspell manual of the Linux project. (To repeat an important point, my English is only at a level four, and reading the manual is quite challenging; my conclusions are drawn from guessing and verification rather than understanding the manual. Rather than relying on an English-challenged Japanese major, you might as well try it yourself :)
MAP#
Let’s start with an example to introduce the most understandable MAP rule:
# Spelling Variants
MAP 89
MAP あア
MAP かカ
MAP さサ
MAP たタ
MAP 89
indicates that there will be 89 MAP rules below.
The MAP rule can be simply understood as ignoring specified character differences. For example, MAP あア
indicates that inputting ア
will be treated as あ
. Therefore, this rule is used in the original project to handle the writing habits of Japanese onomatopoeia, such as チョコチョコ
, which is not found in most authoritative dictionaries, but the replaced form ちょこちょこ
is present in many dictionaries.
However, there is a formula at the end of the original project that I have not fully understood:
MAP (ゃ)(ャ)
According to the manual, Use parenthesized groups for character sequences (eg. for composed Unicode characters):
(use parentheses to group special characters composed of multiple characters, such as those special symbols in Unicode), so these rules should be targeting contracted sounds— for チョロチョロ
, the computer uses the three characters チ
, ョ
, and ロ
to represent it. Therefore, when replacing, do we need to replace them together? But I do not fully understand.
Additionally, rules like the following seem to have no significance because many dictionaries do not include entries like 腕きゞ
, so it doesn't matter whether they are replaced or not. If we really want to solve the problems caused by dancing characters, we might still need to rely on regular expressions (heh, I'm just showing off~ The "Japanese Non-Dictionary Form Dictionary v3" has already perfectly supported this in the last version).
MAP (ゝ)(ヽ)
MAP (ゞ)(ヾ)
(This may not be obvious; looking at images makes it clearer; the symbols in the first column all have a small hook.)
REP#
Next, I will introduce the REP rule, which is quite similar to the MAP rule:
REP 12
REP かけ 掛け
REP かか 掛か
(The original project wrote REP かけ 掛け かけ
, which may be a mistake...)
Like MAP, REP 12
indicates that there will be 12 REP rules following, but unlike MAP, in REP かけ 掛け
, かけ
is the input, and 掛け
is the replacement result (the order is reversed from MAP). Additionally, the REP rule can achieve multiple character replacements (using tabs to separate parameters, while MAP can only use ()
, which is quite limited).
Basic Format of aff Files#
Next, let's go back to introduce the basic format of aff files. All aff files should start like this:
SET UTF-8
LANG ja
FLAG long
# https://github.com/MrCorn0-0/hunspell_ja_JP/
SET
sets the file encoding;
LANG
specifies the language applicable for the rules; please refer to the manual for other languages;
FLAG long
indicates that two ASCII characters are required when naming rules. For example, a rule that will be used later is named XA:
SFX XA Y 1
SFX XA い く い
If you prefer to use numbers to directly number the rules, you can write FLAG num
at the beginning of the file according to the manual: The long' value sets the double extended ASCII character flag type, the num' sets the decimal number flag type.
, and then you can name them like this:
# Adjective く
SFX 001 Y 1
SFX 001 い く い
To facilitate future modifications, it is best to explain the naming through comments when using numbers. Note: #
must be placed at the beginning of each line for that line to be treated as explanatory content.
However, naming by numbers can lead to a problem: how to apply multiple rules to a single word? (Using the inflection of words is not the same, right?)
If using long
, we only need to write the rules directly (the length is fixed at two characters, and the program can recognize it):
高い/XAXB
When using numbers, we need to use a half-width comma to separate:
高い/001,002
SFX#
The reason for going back to introduce the initial part of the aff file is mainly to emphasize that the dic file can specify multiple rules. These multiple rules do not refer to MAP and REP, but to the SFX rules that support custom naming, which will be introduced next.
Here, I specifically emphasize "support for custom naming" because naming will affect both the aff file and the dic file.
It was mentioned earlier that the dic file may contain such a notation:
高い/XA
To some extent, the SFX rule is the true morphological rule. Through this function, we can construct affixes and achieve word form restoration, enabling the derivation of Japanese inflections and the restoration of dictionary forms. (The following content mainly refers to the AFFIX FILE OPTIONS FOR AFFIX CREATION
section of the manual.)
First, let’s explain with a simple example:
SFX XA Y 1
SFX XA い く い
The first line: XA
is the name of the affix we defined, Y is a fixed parameter (the manual mentions that 1 is the number of affixes contained in the affix named XA
); in the second line, the first い
indicates that this rule only applies to words in the dic file that end with い
(the manual states stripping characters from beginning (at prefix rules) or end (at suffix rules) of the word
, my understanding is that we defined an affix named XA, and the actual content of the affix is い
. This affix is the part that the program will process), く
indicates that this rule will take effect when the input word ends with く
, and the trailing い
indicates a condition that must be met before the morphological rule takes effect: "the derived word must end with い"; if this condition is not met, the derivation result will not be displayed.
For example, when we input 高く
, the program will replace く
with い
(here, the い
is the first い
from the second line), and then check if there is a word in the dic file that ends with い
, which is 高い
(the ending with い
is because of the second い
in the second line). If there is, GoldenDict will directly jump to the corresponding interface.
This is considered simple because in the original project, this rule is actually as follows:
# Adjective 文く
SFX XA Y 2
SFX XA し く/BaTe し
SFX XA い く/BaTe い
This is because in Japanese, 高く
can continue to inflect, so the user may select parts like 高くば
or 高くて
. The original project author fully considered this characteristic of Japanese and used the /
rule to handle nested continuous transformations (does it sound familiar? Because in the dic file, this symbol is used to indicate which rules a word can be used for, you can understand it as part of speech).
It is important to note that BaTe
consists of two independent rules, which are custom-defined by the original author. You can find them in the original project's aff file:
SFX Ba Y 1
SFX Ba 0 ば .
(I am not very sure if there is a syntax like 高くば
, but when I isolated this rule for testing, I found that it indeed allows the software to display the explanation for 高い
when inputting 高くば
.)
SFX Te Y 3
SFX Te 0 て [っいしく]
SFX Te 0 で ん
SFX Te 0 て .
(The key is SFX Te 0 て .
; the others are rules that are unrelated to SFX XA し く/BaTe し
and SFX XA い く/BaTe い
from a Japanese grammar perspective; the original project author may have grouped them together out of personal habit.)
Here, a .
appears as a very special character. As mentioned earlier, its position indicates the character that should be included at the end of the replacement result, and there is a rule that states Zero condition is indicated by dot.
, so SFX Te 0 て .
means that for any て
at the end of the word, it should be deleted.
This explanation may be difficult to understand regarding the function of the rules. Let's return to the rule SFX XA い く/BaTe い
and put it together with SFX Te 0 て .
to illustrate another example for better understanding: The original project author designed this to handle the input 高くて
. (Removing /BaTe
would raise the requirements for selecting words, so the original project is indeed well-designed.)
The previous example involves double nesting, which may still be difficult to understand. If you have questions, refer to the manual's sections on Twofold suffix stripping
and AFFIX FILE OPTIONS FOR AFFIX CREATION
. Actually, I haven't really understood it either
Here’s another example:
SFX To Y 1
SFX To 0 とも .
To
is the rule name, 0
indicates that this rule will remove the defined affix とも
, and とも
represents the affix of the actual input word, while .
indicates that there are no requirements for the replacement result. Therefore, the function of the rule named To
is to remove the とも
at the end of the input.
Additionally, in SFX Te 0 て [っいしく]
, the [っいしく]
indicates that any of the characters っ,い,し,く must be present in the input characters before replacement.
Below are some more complex custom rules, which I will briefly introduce:
[^行]く
has the same meaning as in regular expressions; the rule only applies to words that do not contain 行く but end with い:
SFX 5T く い/TeTaTrTm [^行]く
SFX 5T く っ/TeTaTrTm 行く
This rule is slightly longer but actually has no special significance:
SFX KN く け/RUTeTaTrTmf1f2f3f4f5f6m0m1m2m3m4m5m6m7TiTItiSuNgQq1Eba1M1myo く
This rule can actually be split into two, but the original author flexibly used the []
syntax:
SFX xU い かる/bs [しじ]い
Aside#
My interest in morphology stems entirely from encountering tricky problems in the "Japanese Non-Dictionary Form Dictionary" project, and I wanted to see if there were other solutions. That's why I spent nearly a week reading the obscure manual. Although I only have a rough understanding, I have already felt the power of GoldenDict's morphological Hunspell functionality~~ Linguistics is eternal!~~. With the idea that teaching someone to fish is better than giving them a fish, I share my summary in hopes of helping everyone understand this functionality a bit better, and I look forward to everyone working together to improve GoldenDict's morphological Hunspell functionality. Go submit issues and PRs at - hunspell_ja_JP
However, there is an even more important reason: I just joined FreeMdict, and @epistularum created a GoldenDict morphological demo using a similar approach to the "Japanese Non-Dictionary Form Dictionary" (on GitHub). To communicate with them, I decided to formally prioritize this matter that I had postponed for months (my poor English really makes it difficult...).
Additionally, I would like to preview that GoldenDict can easily solve the previously mentioned issues with the kanji writing of Japanese compound verbs:
Interestingly, GoldenDict's Hunspell functionality can return multiple results, while the Eudic dictionary also has a similar Hunspell functionality but only supports returning one result. Although the manual states that only one feature may not be supported on mobile: BUG: UTF-8 flag type doesn't work on ARM platform.
, Eudic wouldn't avoid adopting this technology for that reason, right...
But in any case, it should support multiple spelling results, such as:
雨が降ります。
バスから降ります。
Similar Functionality in Eudic#
I discussed the similar functionality in Eudic with friends on the FreeMdict Forum:
No major optimizations were found, but some minor optimizations do exist:
- There are some missing sentence patterns, such as
言わざるを得ない
's言わざる
(but selecting言わ
will also yield results) - Colloquial expressions like ん、と、ちゃ,etc.
The Hunspell technology seems to only solve double nesting, so I estimate that complex sentence patterns like 食べたければ
may not be solvable (which means you still need to think carefully before selecting words; you can't just point at whatever you don't understand).
Additionally, I may not have expressed myself clearly: Eudic has a "restoration of inflections" function, but the technology does not seem to resemble Hunspell:
(multiple nesting, the original project seems unable to achieve this)
(selecting only the end of the word, the original project can do this)
Ignoring Japanese writing habits
Adjectives do not even support the simplest transformations
From the results, it seems that Eudic has specifically developed a non-open-source inflection derivation tool but does not allow user customization, so it might be worth trying to provide feedback to Eudic's official team to help them improve.
References#
Tutorials#
-
Linux hunspell official website
- The key document is the man 4 hunspell file; other documents introduce the technical details of the Linux implementation.
- Here is my annotated version:
-
Introducing Hunspell to German Language Students - Xu Yinuo's article - Zhihu
- Organized the materials used in the man 4 hunspell PDF and explained some simple concepts.
Open Source Projects#
- MrCorn0-0/hunspell_ja_JP: Hunspell morphology dictionary for Japanese used in GoldenDict. (github.com): Written nearly 400 morphological rules based on Japanese grammar; the author is Chinese.
- epistularum/hunspell-ja-deinflection: Hunspell dictionary to deinflect all Japanese conjugated verbs to the dictionary form and suggest correct spelling. (github.com): Rules written based on the idea of replacing word endings; not very complete, the author is a foreigner.
- https://github.com/wooorm/dictionaries: Contains morphological rules written in JavaScript, but does not include Japanese.
Related#
Note: This article is backed up on the following platforms: