卿少納言

卿少納言

JavaScript & Japanese, Python & Polyglot, TypeScript & Translate.
zhihu
github
email
x

Mecab Installation Guide

A brief introduction to the installation method of the tool [[Mecab]] used in Japanese natural language processing.

Mecab Installation Guide#

Introduction#

If you only need to analyze a small amount of data, there are ready-made tools online such as Web ちゃまめ that can parse text online.

There are many tutorials on installing MeCab online, but upon closer inspection, they are often written too casually. Only the article Japanese Word Segmenter Mecab Documentation I Love Natural Language Processing, which translates the official documentation, is worth a detailed read.

Additionally, the discussions in Simple Use of Mecab Japanese Word Segmentation Tool - FreeMdict Forum are also worth reading.

If there are other good tutorials, feel free to add them.

Before getting into the main topic, let's briefly mention the two key factors that affect "morphological analysis": morphological analyzers developed based on different algorithms and morphological analysis dictionaries.

There are many analyzers available, such as awesome-japanese-nlp-resources, which lists a large number of analyzers developed in various programming languages and optimized for different use cases.

However, the morphological analysis dictionaries are relatively singular, currently dominated by ChaSen, JUMAN, and Unidic Dictionary, with the latest version of ChaSen being 2.4.5 updated on June 25, 2012, JUMAN's latest version 1.02 updated on January 12, 2017, and only the Unidic Dictionary maintaining an annual update frequency in the last five years.

Currently, the most widely used open-source morphological analyzer is MeCab, and below is an explanation of how to install and use it on Windows. Just remember "morphological analysis = morphological analyzer + morphological analysis dictionary," and installing other morphological analyzers shouldn't be too much of a problem.

Here is a backup installation file:
https://www.123pan.com/s/iGz0Vv-svEVh.html

Installation via Installation Package#

First, let's introduce the safest method, suitable for those who only need the default parsing format.

First, go to the official homepage MeCab: Yet Another Part-of-Speech and Morphological Analyzer to download the installation package provided by the official site: https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7WElGUGt6ejlpVXc (the corresponding path for the backup installation package: morphological analysis > morphological analyzer > mecab).

(Various third-party libraries can be found online, some of which come with the MeCab installation package, so no installation is needed, but the version may not be the latest 0.996, so it is more recommended to follow this article and install it from scratch.)

When installing, make sure to check the utf-8 encoding, and you can keep clicking next for other options.

|500

Note: You can change the program path, but if you do, you may need to manually add it to the environment variables. In fact, the main program of MeCab does not take up much space, and I personally think there is no need to change the path. (Mainly because if something goes wrong, you might have to reinstall it 2333)

At this point, the installation is essentially complete, and you can directly call it via the command line. For specific command lines, refer to Japanese Word Segmenter Mecab Documentation I Love Natural Language Processing.

Installation via mecab-python3#

The above method is not flexible enough for scenarios that require processing a large amount of custom-formatted data. Moreover, the official site does not provide an installation package for macOS, so below is another method using Python.

First, install the third-party library mecab-python3:

pip install mecab-python3

Then use the following commands to install and switch to the unidic-lite dictionary.

pip install unidic-lite
pip install --no-binary :all: mecab-python3

Then run the following code for testing; if there are no errors, the installation is complete.

import MeCab
tagger = MeCab.Tagger("-Owakati")
print(tagger.parse("天気が良いから、散歩しましょう。").split())

tagger = MeCab.Tagger()
print(tagger.parse("天気が良いから、散歩しましょう。"))

|500

Possible Issues#

MeCab is a tool developed in C++, and many tools can be found online to call it. However, unexpected issues often arise during the environment configuration step, and here are some feedbacks recorded:

The following feedback is from amob:

First, when pip installing some C++ based Python libraries, you need to run the command line 'Native(or Cross) Tools Command Prompt' that comes with Visual Studio, rather than the system's default cmd.
I also forgot which misleading MeCab tutorial I saw before, because the default encoding for command line text in MeCab would not display correctly, and I added an autorun item in the registry to set the default to UTF-8, which also affected the normal operation of the Visual Studio environment...
Then it still reported 'Microsoft Visual C++ 14.0 is required', only to find out that running the command: pip install --upgrade setuptools would solve the issue.
Reference pages:
visual studio: x64 Native Tools Command Prompt for VS 2019 initialization failed_script “vsdevcmd\ext\active” could not be found.-CSDN Blog
python pip on Windows - command ‘cl.exe’ failed - Stack Overflow
‘Microsoft Visual C++ 14.0 is required’ in Windows 10 - Microsoft Community

Custom Dictionary#

The dictionary that comes with MeCab installed via the installation package is ipadic, which was last updated in May 2003.

The unidic-lite installed via mecab-python3, according to its README documentation, is version 2.1.2 from 2013:

At the moment it uses Unidic 2.1.2, from 2013, which is the most recent release of UniDic that's small enough to be distributed via PyPI.

If you have requirements for parsing accuracy, it is more recommended to install the Unidic Dictionary maintained by the National Institute for Japanese Language and Linguistics.

If there are no special requirements, just download the latest version of "Modern Written Language UniDic" https://clrd.ninjal.ac.jp/unidic_archive/2302/unidic-cwj-202302.zip (Note: Updated on March 24, 2023, confirmed as the latest version on February 23, 2024) (backup path: morphological analysis > morphological analysis dictionary > UniDic).

When extracting the files, pay attention to the folder name; it is best to name it unidic-cwj-3.1.1 (if it is not named this way, please modify the subsequent code's dic_path).

|500

Then you can test with the following code.

import os
import MeCab

# Please ensure the following path matches the actual installation path; to keep it consistent with the screenshot, the folder name has been modified
dic_path = "D:\00temp\unidic-cwj-3.1.1"
tagger = MeCab.Tagger(
    '-r nul -d {} -Ochasen'.format(dic_path).replace('\\', '/'))

text = "天気が良いから、散歩しましょう。"
print(type(tagger.parse(text)))
print(tagger.parse(text).split("\n"))
print(tagger.parse(text))

mecab-ipadic-NEologd#

mecab-ipadic-NEologd: Neologism dictionary for MeCab
Project address: https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md
License: Apache License, Version 2.0

The term "Neologism" in the project name means "new words," so this morphological analysis dictionary has good parsing effects for new words. However, it requires compiling the source code, which I have not attempted; I hope someone will provide a tutorial.

References#

Simple Use of Mecab Japanese Word Segmentation Tool - FreeMdict Forum: Provides very detailed explanations and example code.

Other Morphological Analyzers#

As mentioned earlier, awesome-japanese-nlp-resources can find many morphological analyzers. In addition to specific use cases, it is also worth looking at the evaluations of other morphological analyzers in The Development Background of MeCab:

Juman's commercially distributed morphological analyzers before were almost fixed in terms of dictionaries and part-of-speech systems, and users could not freely define them. Juman allowed all these definitions to be externalized, enabling free definitions.
Dictionaries were relatively easy to obtain, but the definitions of connection costs and word occurrence costs had to be done manually. Every time a parsing error was discovered, it was necessary to correct the connection costs within a range that would not cause side effects, leading to high development costs.
Additionally, since Juman was developed for Japanese morphological analysis, the unknown word processing was specialized for Japanese, and users could not define their own unknown word processing. Furthermore, the part-of-speech system was fixed to two levels, which imposed certain limitations on the part-of-speech system.

One of the contributions of ChaSen is that it began estimating connection costs and word occurrence costs through statistical processing (HMM). Thanks to this processing, it became possible to automatically estimate cost values simply by accumulating parsing errors. Furthermore, the part-of-speech hierarchy became unlimited, allowing for (truly) free definitions, including the part-of-speech system.
However, the more complex the part-of-speech system, the more the problem of data sparsity arises. When using HMM, it is necessary to fix the internal state (Hidden Class) of the HMM to one, requiring a "conversion" from each part of speech to the internal state. It is simple to assign each part of speech to one internal state, but if all parts of speech are expanded to include conjugation, the number can reach 500, making it impossible to obtain reliable estimates for low-frequency parts of speech. Conversely, for high-frequency parts of speech such as "particles," high accuracy cannot be achieved unless they are included in the internal state. The more complex the part-of-speech system, the more difficult it becomes to define the internal state. In other words, the current (complex) part-of-speech system is insufficient for HMM, and the manual costs to supplement it are increasing.
Additionally, ChaSen does not come with a cost value estimation module. It seems to be usable internally at NAIST, but due to the reasons mentioned above, there are many parameters that need to be set, making it difficult to master.
Furthermore, ChaSen's unknown word processing is also hard-coded and cannot be freely defined.

The concerns raised in the above evaluations are generally consistent with the development trends of morphological analyzers:

  1. Morphological analyzers are abandoning grammar rule-based approaches and completely shifting to purely mathematical algorithms based on statistics;
  2. They are abandoning custom morphological analysis dictionaries and using resources like UniDic that are constructed and maintained by authoritative institutions;
  3. They are beginning to attempt to support the parsing of multiple languages simultaneously.

Here are a few morphological analyzers that I have personally researched a bit:

GiNZA - Japanese NLP Library:
Development Language: Python
License: MIT license
Last Updated: 2023-09-25

Note: This morphological analysis tool was open-sourced in 2019 by Megagon Labs, an AI research institution under the Japanese company Recruit (リクルート).

Kuromoji
Development Language: Java
License: Apache-2.0 license
Last Updated: 5 years ago

Note: From the results extracted from the MOJi Android APK, it seems that this is the parsing tool used. Additionally, [[Elasticsearch]] also uses this by default.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.