Summary: This article is based on a discussion about Hunspell by xiaoyifang, the maintainer of FreeMdict and GoldenDict.
Morphology dictionaries do not work with EPWING (hunspell), is there a way to fix this? I have already opened an issue on the original goldendict github. Should I open another issue on xiaoyifang’s fork?
Machine translation:
Morphological dictionaries cannot be used with EPWING (hunspell), is there a way to solve this problem? I have already opened an issue on the original goldendict github. Should I open another issue on xiaoyifang's fork?
Sorry, although you have fixed the issue, I still want to clarify: the results of the Hunspell function in goldendict are determined by the words included in files with the .dic suffix and are not affected by the dictionaries that are not loaded.
The mechanism on the Android side of Youdao Dictionary is different; it is influenced by the dictionaries loaded by the user (your suspicion is valid for this software).
Additionally, I believe that for Japanese to be able to look up words completely using the clipboard like English, some issues may (possibly) not be solvable through Hunspell—Japanese morphological changes are much more complex than English, and Japanese has unique typesetting rules. Therefore, I specifically created the "Japanese Non-dictionary Morphological Dictionary," and you can refer to these posts and their discussion areas to get an overview:
This discusses the characteristics of Japanese morphological changes
https://forum.freemdict.com/t/topic/11523
This discusses the unique typesetting rules of Japanese
https://forum.freemdict.com/t/topic/14241/17
If you would like to know more, I will take the time to organize an article to systematically and thoroughly explain what special aspects Japanese has (it may take about a month; if I finish organizing it, I will specifically notify you).
If possible, I hope GoldenDict can natively handle the issues mentioned above, without using the methods mentioned here [https://forum.freemdict.com/t/topic/14241]—using scripts and tools like Quicker for implementation.
Black Book Style#
Not familiar with Japanese, does this relate to here?
It should not be significantly related to Chinese learners of Japanese. The areas you mentioned mainly affect the romanization of Japanese (you can think of it as pinyin; they are all Latin letters).
Specifically, the things to be handled are similar to the differences between the commonly used simplified pinyin scheme and the Wade-Giles pinyin scheme in Chinese. The effect after activation is like this:
Tsinghua University is processed as Qinghua University,
Tsingtao is processed as Qingdao,
Peking University is processed as Beijing University
(These examples may not be very rigorous, as some spelling methods are based on pronunciations that are not the current Mandarin)
In other words, it mainly resolves the differences in romanization (or Latin letter) spelling schemes. Chinese learners of Japanese rarely look up words using romanization (like taberu); they generally use kana (like たべる) or kanji (like 食べる) for lookup. However, I have seen some dictionary websites designed by foreigners that support romanization for lookups, which is why such a function exists (but Chinese people probably won't use it).
I want to address issues similar to those caused by English tense, like the following: (For ease of comparison and explanation, the example sentences are created by me)
私はご飯を食べている(I am having dinner)
I am having dinner
私はご飯を食べていた(At that time, I was having dinner)
I was having dinner
私はご飯を食べた。(I had dinner)
I had dinner.
私はご飯を食べなかった(I didn't have dinner)
I didn't have dinner.
母親は私をご飯を食べさせる。(Mom lets me have supper)
Mom lets me have supper
母親は私をご飯を食べさせない。(Mom won't let me have dinner)
Mom won't let me have dinner.
The bold parts are the verbs in both languages (also the parts that need to be delineated using the regular morphological function), and it can be seen that when English expresses different meanings, morphological changes do not occur consecutively on a single verb (thus, the inflection of verbs is much less, with only three forms); whereas when Japanese expresses different meanings, morphological changes are nested multiple times on the verb (hence, each of the above sentences is a new inflection, and there are far more than these). This leads to the morphological files for Japanese being very complex, which is why I want to try other solutions.
My solution is not very academic (the bold parts are the parts that need to be delineated using my proposed scheme):
私はご飯を食べている(I am having dinner)
I am having dinner
私はご飯を食べていた(At that time, I was having dinner)
I was having dinner
私はご飯を食べた。(I had dinner)
I had dinner.
私はご飯を食べなかった(I didn't have dinner)
I didn't have dinner.
母親は私をご飯を食べさせる。(Mom lets me have supper)
Mom lets me have supper
母親は私をご飯を食べさせない。(Mom won't let me have dinner)
Mom won't let me have dinner.
It can be seen that the final kana of 食べる,which is る,has some repetition, so I created an mdx file by exhaustively listing the inflections of the last kana (i.e., "Japanese Non-dictionary Morphological Dictionary v1" and "Japanese Non-dictionary Morphological Dictionary v2")—all entries like 食べら,食べり,食べれ,食べさ,食べま,食べろ point to 食べる,and then in "Japanese Non-dictionary Morphological Dictionary v3," I used Python and JavaScript to create two scripts to reverse-engineer the original form based on the exhaustive rules and compared the results with "Japanese Non-dictionary Morphological Dictionary v2." (Although it is a kind of mutual combat, it still has some verification value)
In summary, after putting my ideas into practice and conducting practical tests for six months (I received feedback on the forum), I did not find any serious issues. So next, I will first submit a PR to Salad Dictionary (communication will be a bit easier, plus I don't understand the C++ and C used by goldendict) and observe the actual effects.
Possibly useless reference: Hepburn Romanization - Wikipedia, the free encyclopedia (wikipedia.org)
Mecab#
Japanese morphological analysis, such as extracting the base form, has existing libraries in the industry. The most popular open-source libraries currently seen are all based on word banks (ipadic/unidic) for analysis, and there are also some that use custom rules for analysis, but the results are machine-trained based on word banks. Will handwritten rule analysis have any issues? It is best to test with a larger sample: https://clrd.ninjal.ac.jp/unidic/
(Deleting it is useless; I have archived it via email)
The tool you recommended should be used for analyzing articles; word segmentation is probably not the original intention of this tool, and there are certain differences between the two (for example, when segmenting words, the contextual meaning is basically lost; also, the segmented text has not been cleaned and requires special preprocessing).
However, we can refer to their processing details and make certain modifications (we don't need to worry about word segmentation; we only need to focus on the derivation process after segmentation).
Below are my half-finished notes to give everyone some ideas (not a computer science major, only know Python, so please don't be misled by me):
The developer provides source code for other languages, here (you can only scroll down slowly; I don't know why you can't search...)
But after downloading, I found the file was too small
Can 3 Python files implement Japanese NLP? 233, it should still need to call the packaged exe (but I want to study the processing details; I can't possibly read binary code...)... Also, it uses Python 2 syntax...
So I didn't continue researching.
Not giving up, I found another one:
(SamuraiT/mecab-python3: mecab-python. you can find original version here //taku910.github.io/mecab/)
An unofficial interface, although it provides a Python interface, the actual processing process is (should be) not Python.
About Large Sample Verification#
I don't want to write code, nor do I want to ask others to write code, so I deleted it. Modern search engines basically use this set of word banks and morphological analysis tools, but they are not suitable for client-side use. It is great that you can summarize and improve, but it is best to test with a large sample so that client-side developers will have confidence in using it.
Yes, I agree with your point; we need to verify with a large sample. Relying solely on manual collection is too slow (in fact, the idea of collecting inflections consciously started two years ago, but it was only six months ago that I realized I still missed a lot after starting to take action).
The mecab you recommended has two columns of data in the segmentation results that can be used for verification comparison, but I don't have segmented corpus on hand, so I previously only briefly explained:
[ ] Students with Mecab segmentation corpus are welcome to send it to [email protected], I only need the
書字形
and書字形基本形
two columns of data, I would be very grateful :)
Two months have passed, and sure enough I haven't received a single file (maybe I should specifically open a post 233).
However, I have found an old computer, and I will take the time to create a segmented corpus, expecting to start large sample verification around the National Day.