卿少納言

卿少納言

JavaScript & Japanese, Python & Polyglot, TypeScript & Translate.
zhihu
github
email
x

A Brief Analysis of the Morphological Rules of Goldendict Using the hunspell_ja_JP Project File as an Example

Abstract: This article analyzes the morphological rules of GoldenDict using the hunspell_ja_JP project files as an example. (Well, this article is quite technical, and I don't really know how to write an abstract, 2333)

Introduction#

This article uses the morphological files from the MrCorn0-0/hunspell_ja_JP: Hunspell morphology dictionary for Japanese used in GoldenDict. (github.com) project (hereinafter referred to as the "original project") as an example, referencing the man 4 hunspell PDF file provided by the Linux hunspell official website (hereinafter referred to as the "manual") to explain the basic morphological rules.

A preliminary statement: I am not very knowledgeable and have only roughly understood some rules of the original project's morphological files. There may be errors, so I hope everyone reads rationally and communicates kindly. Additionally, the original project is only for Japanese; if you have morphological questions for other languages, I may not be able to answer them, so please refer to the manual.

The morphological functionality of GoldenDict primarily uses Hunspell, which is originally a spell-checking tool for Linux. Therefore, there are slight differences from what is mentioned in Introducing Hunspell to German Language Students. GoldenDict's morphology only includes two files with the extensions .dic and .aff, and does not include files with the extensions .morph, .good, .wrong, or .sug. However, the rules for .dic and .aff files are completely consistent with those in the Hunspell manual.

Regarding the functions of these two files, I quote from Introducing Hunspell to German Language Students:

  • The .dic dictionary file contains entries similar to the headwords (lexemes) in a printed dictionary.
  • The .aff file contains a set of rules for restoring forms, detailing how to convert a complex form with prefixes, suffixes, or compounded with other words into a headword that exists in the dictionary.

Basic Format of dic Files#

The dic file is relatively simple, structured like the following (the initial number indicates the number of entries in the dic file):

450851





高い/XA

Some words have a / followed by what appears to be gibberish; this notation can be understood as indicating the part of speech of the word, indicating that this word has a transformation rule named XA, which will be explained in detail later. Here, it is only necessary to know that the dic file can specify the part of speech of a word.

Additionally, the words included in GoldenDict's dic file will affect the final lookup results when the morphological function is enabled, so it is best to treat the words in this file with caution and not modify them recklessly.

aff File#

The rules for the aff file are very complex. Below, I will only explain the rules used in the original project. Those studying other languages or with other needs should refer to the man 4 Hunspell manual of the Linux project. (To repeat an important point, my English is only at a level four, and reading the manual is quite challenging; my conclusions are drawn from guessing and verification rather than understanding the manual. Rather than relying on an English-challenged Japanese major, you might as well try it yourself :)

MAP#

Let’s start with an example to introduce the most understandable MAP rule:

# Spelling Variants
MAP	89
MAP	あア
MAP	かカ
MAP	さサ
MAP	たタ

MAP 89 indicates that there will be 89 MAP rules below.

The MAP rule can be simply understood as ignoring specified character differences. For example, MAP あア indicates that inputting will be treated as . Therefore, this rule is used in the original project to handle the writing habits of Japanese onomatopoeia, such as チョコチョコ, which is not found in most authoritative dictionaries, but the replaced form ちょこちょこ is present in many dictionaries.

However, there is a formula at the end of the original project that I have not fully understood:

MAP	(ゃ)(ャ)

According to the manual, Use parenthesized groups for character sequences (eg. for composed Unicode characters): (use parentheses to group special characters composed of multiple characters, such as those special symbols in Unicode), so these rules should be targeting contracted sounds— for チョロチョロ, the computer uses the three characters , , and to represent it. Therefore, when replacing, do we need to replace them together? But I do not fully understand.

Additionally, rules like the following seem to have no significance because many dictionaries do not include entries like 腕きゞ, so it doesn't matter whether they are replaced or not. If we really want to solve the problems caused by dancing characters, we might still need to rely on regular expressions (heh, I'm just showing off~ The "Japanese Non-Dictionary Form Dictionary v3" has already perfectly supported this in the last version).

MAP	(ゝ)(ヽ)
MAP	(ゞ)(ヾ)

(This may not be obvious; looking at images makes it clearer; the symbols in the first column all have a small hook.)

REP#

Next, I will introduce the REP rule, which is quite similar to the MAP rule:

REP 12
REP かけ  掛け
REP かか  掛か

(The original project wrote REP かけ 掛け かけ, which may be a mistake...)

Like MAP, REP 12 indicates that there will be 12 REP rules following, but unlike MAP, in REP かけ 掛け, かけ is the input, and 掛け is the replacement result (the order is reversed from MAP). Additionally, the REP rule can achieve multiple character replacements (using tabs to separate parameters, while MAP can only use (), which is quite limited).

Basic Format of aff Files#

Next, let's go back to introduce the basic format of aff files. All aff files should start like this:

SET UTF-8
LANG ja
FLAG long
# https://github.com/MrCorn0-0/hunspell_ja_JP/

SET sets the file encoding;

LANG specifies the language applicable for the rules; please refer to the manual for other languages;

FLAG long indicates that two ASCII characters are required when naming rules. For example, a rule that will be used later is named XA:

SFX	XA	Y	1	
SFX	XA	い	く	い

If you prefer to use numbers to directly number the rules, you can write FLAG num at the beginning of the file according to the manual: The long' value sets the double extended ASCII character flag type, the num' sets the decimal number flag type., and then you can name them like this:

# Adjective く
SFX	001	Y	1	
SFX	001	い	く	い

To facilitate future modifications, it is best to explain the naming through comments when using numbers. Note: # must be placed at the beginning of each line for that line to be treated as explanatory content.

However, naming by numbers can lead to a problem: how to apply multiple rules to a single word? (Using the inflection of words is not the same, right?)

If using long, we only need to write the rules directly (the length is fixed at two characters, and the program can recognize it):

高い/XAXB

When using numbers, we need to use a half-width comma to separate:

高い/001,002

SFX#

The reason for going back to introduce the initial part of the aff file is mainly to emphasize that the dic file can specify multiple rules. These multiple rules do not refer to MAP and REP, but to the SFX rules that support custom naming, which will be introduced next.

Here, I specifically emphasize "support for custom naming" because naming will affect both the aff file and the dic file.

It was mentioned earlier that the dic file may contain such a notation:

高い/XA

To some extent, the SFX rule is the true morphological rule. Through this function, we can construct affixes and achieve word form restoration, enabling the derivation of Japanese inflections and the restoration of dictionary forms. (The following content mainly refers to the AFFIX FILE OPTIONS FOR AFFIX CREATION section of the manual.)

First, let’s explain with a simple example:

SFX	XA	Y	1	
SFX	XA	い	く	い

The first line: XA is the name of the affix we defined, Y is a fixed parameter (the manual mentions that 1 is the number of affixes contained in the affix named XA); in the second line, the first indicates that this rule only applies to words in the dic file that end with (the manual states stripping characters from beginning (at prefix rules) or end (at suffix rules) of the word, my understanding is that we defined an affix named XA, and the actual content of the affix is . This affix is the part that the program will process), indicates that this rule will take effect when the input word ends with , and the trailing indicates a condition that must be met before the morphological rule takes effect: "the derived word must end with い"; if this condition is not met, the derivation result will not be displayed.

For example, when we input 高く, the program will replace with (here, the is the first from the second line), and then check if there is a word in the dic file that ends with , which is 高い (the ending with is because of the second in the second line). If there is, GoldenDict will directly jump to the corresponding interface.

This is considered simple because in the original project, this rule is actually as follows:

# Adjective 文く
SFX	XA	Y	2	
SFX	XA	し	く/BaTe	し
SFX	XA	い	く/BaTe	い

This is because in Japanese, 高く can continue to inflect, so the user may select parts like 高くば or 高くて. The original project author fully considered this characteristic of Japanese and used the / rule to handle nested continuous transformations (does it sound familiar? Because in the dic file, this symbol is used to indicate which rules a word can be used for, you can understand it as part of speech).

It is important to note that BaTe consists of two independent rules, which are custom-defined by the original author. You can find them in the original project's aff file:

SFX	Ba	Y	1	
SFX	Ba	0	ば	.

(I am not very sure if there is a syntax like 高くば, but when I isolated this rule for testing, I found that it indeed allows the software to display the explanation for 高い when inputting 高くば.)

SFX	Te	Y	3	
SFX	Te	0	て	[っいしく]
SFX	Te	0	で	ん
SFX	Te	0	て	.

(The key is SFX Te 0 て .; the others are rules that are unrelated to SFX XA し く/BaTe し and SFX XA い く/BaTe い from a Japanese grammar perspective; the original project author may have grouped them together out of personal habit.)

Here, a . appears as a very special character. As mentioned earlier, its position indicates the character that should be included at the end of the replacement result, and there is a rule that states Zero condition is indicated by dot., so SFX Te 0 て . means that for any at the end of the word, it should be deleted.

This explanation may be difficult to understand regarding the function of the rules. Let's return to the rule SFX XA い く/BaTe い and put it together with SFX Te 0 て . to illustrate another example for better understanding: The original project author designed this to handle the input 高くて. (Removing /BaTe would raise the requirements for selecting words, so the original project is indeed well-designed.)

The previous example involves double nesting, which may still be difficult to understand. If you have questions, refer to the manual's sections on Twofold suffix stripping and AFFIX FILE OPTIONS FOR AFFIX CREATION. Actually, I haven't really understood it either

Here’s another example:

SFX	To	Y	1	
SFX	To	0	とも	.

To is the rule name, 0 indicates that this rule will remove the defined affix とも, and とも represents the affix of the actual input word, while . indicates that there are no requirements for the replacement result. Therefore, the function of the rule named To is to remove the とも at the end of the input.

Additionally, in SFX Te 0 て [っいしく], the [っいしく] indicates that any of the characters っ,い,し,く must be present in the input characters before replacement.

Below are some more complex custom rules, which I will briefly introduce:

[^行]く has the same meaning as in regular expressions; the rule only applies to words that do not contain 行く but end with い:

SFX	5T	く	い/TeTaTrTm	[^行]く
SFX	5T	く	っ/TeTaTrTm	行く

This rule is slightly longer but actually has no special significance:

SFX	KN	く	け/RUTeTaTrTmf1f2f3f4f5f6m0m1m2m3m4m5m6m7TiTItiSuNgQq1Eba1M1myo	く

This rule can actually be split into two, but the original author flexibly used the [] syntax:

SFX	xU	い	かる/bs	[しじ]い

Aside#

My interest in morphology stems entirely from encountering tricky problems in the "Japanese Non-Dictionary Form Dictionary" project, and I wanted to see if there were other solutions. That's why I spent nearly a week reading the obscure manual. Although I only have a rough understanding, I have already felt the power of GoldenDict's morphological Hunspell functionality~~ Linguistics is eternal!~~. With the idea that teaching someone to fish is better than giving them a fish, I share my summary in hopes of helping everyone understand this functionality a bit better, and I look forward to everyone working together to improve GoldenDict's morphological Hunspell functionality. Go submit issues and PRs at - hunspell_ja_JP

However, there is an even more important reason: I just joined FreeMdict, and @epistularum created a GoldenDict morphological demo using a similar approach to the "Japanese Non-Dictionary Form Dictionary" (on GitHub). To communicate with them, I decided to formally prioritize this matter that I had postponed for months (my poor English really makes it difficult...).

Additionally, I would like to preview that GoldenDict can easily solve the previously mentioned issues with the kanji writing of Japanese compound verbs:
|500

Interestingly, GoldenDict's Hunspell functionality can return multiple results, while the Eudic dictionary also has a similar Hunspell functionality but only supports returning one result. Although the manual states that only one feature may not be supported on mobile: BUG: UTF-8 flag type doesn't work on ARM platform., Eudic wouldn't avoid adopting this technology for that reason, right...

But in any case, it should support multiple spelling results, such as:

雨が降ります。
バスから降ります。

Similar Functionality in Eudic#

I discussed the similar functionality in Eudic with friends on the FreeMdict Forum:

No major optimizations were found, but some minor optimizations do exist:

  1. There are some missing sentence patterns, such as 言わざるを得ない's 言わざる (but selecting 言わ will also yield results)
  2. Colloquial expressions like ん、と、ちゃ,etc.

The Hunspell technology seems to only solve double nesting, so I estimate that complex sentence patterns like 食べたければ may not be solvable (which means you still need to think carefully before selecting words; you can't just point at whatever you don't understand).

Additionally, I may not have expressed myself clearly: Eudic has a "restoration of inflections" function, but the technology does not seem to resemble Hunspell:
|500
(multiple nesting, the original project seems unable to achieve this)

|500
(selecting only the end of the word, the original project can do this)

Ignoring Japanese writing habits
|500
Adjectives do not even support the simplest transformations
|500
From the results, it seems that Eudic has specifically developed a non-open-source inflection derivation tool but does not allow user customization, so it might be worth trying to provide feedback to Eudic's official team to help them improve.

References#

Tutorials#

Open Source Projects#

Note: This article is backed up on the following platforms:

Analyzing the Morphological Rules of GoldenDict Using the hunspell_ja_JP Project Files as an Example - NoHeartPen's article - Zhihu

Analyzing the Morphological Rules of GoldenDict Using the hunspell_ja_JP Project Files as an Example - Software Experience Exchange Outlook - FreeMdict Forum

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.