The online seminar on Japanese corpus linguistics, "Language Resource Workshop 2024," is here!
Language Resource Workshop 2024#
An online seminar related to corpora and computational linguistics hosted by the National Institute for Japanese Language and Linguistics. Please fill out the registration form on the official website before attending the conference https://clrd.ninjal.ac.jp/lrw2024.html (free).
Next, I will list some of the presentations I am interested in; the complete conference details can be found on the official website: https://clrd.ninjal.ac.jp/lrw2024-programme.html.
Additionally, the schedule for the academic conference "68th Annual Meeting of the Society for Quantitative Linguistics," hosted by the National Institute for Japanese Language and Linguistics, has also been released. Since it will be held offline, you can visit the official website for more information if you are interested.
https://sites.google.com/view/mathling2024/%E3%83%9B%E3%83%BC%E3%83%A0
Day 1: August 28 (Wednesday)#
09:30〜10:45#
o01: [[The occurrence of "sentence inclusion" in conversation data]]
What is "sentence inclusion": Expressions such as "hurry up aura," "I'm trying my best appeal," and "let's start the Pokémon card game campaign" are unique linguistic phenomena that deviate from the general word formation rules, where elements equivalent to a "sentence" occur within the word, and larger units cannot fit within the word (this presentation refers to it as "sentence inclusion").
I have collected a large number of example sentences from anime subtitles while researching [[non-dictionary]], many of which do not conform to standard Japanese grammar and are quite similar to the "sentence inclusion" this presentation wants to discuss. I want to see how academia views these less standard example sentences.
10:55〜12:10#
o04s: [[Verification of the effectiveness of large-scale language models for the semantic classification of katakana words]]
This paper reports on the methods and results of semantic classification of katakana words in context using LLM.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o04s-paper.pdf
Semantic classification? I'm curious about how it's done. I designed a prompt in this direction:
# Role: Dictionary Query Assistant
## Profile
- Author: NoHeartPen
- Version: 0.1
- Description: The dictionary query assistant searches for the closest meaning to the context from the complete explanations provided by authoritative dictionaries.
## Rules
1. Respect the original text; do not translate the complete explanations provided by the dictionary, and do not modify the complete explanations provided by the dictionary.
2. When a usage not included in the dictionary appears in the context, return "The dictionary has not included this usage." At other times, no additional explanation is needed; just return the dictionary explanation.
## Workflow
1. Ask the user to provide context in the format "Context: [], Word to query: [], Complete dictionary explanation: []".
2. Analyze the closest explanation in the complete dictionary explanation provided by the user to the context, based on the word to query.
3. Only return the relevant explanation closest to the context; do not return other explanations unrelated to the context.
4. No need to translate the dictionary explanation, no need for any additional explanation.
## Initialization
As the role <Role>, strictly adhere to <Rules>, and warmly welcome the user. Then introduce yourself and inform the user of <Workflow>.
## Example
Context: [全部さらけ出して], Word to query: [さらけ出して], Complete dictionary explanation: [さらけ‐だ・す【×曝け出す】
[動サ五(四)]
① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
② 追い出す。
「おらあ女房を―・してしまって」〈滑・膝栗毛・発端〉]
Your answer: ① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
(Note: This prompt performs poorly on GPT3.5 and many domestic AIs, but works well on GPT4o mini, allowing for quick searches for the most similar meanings in authoritative dictionaries like "Daijisen." A slight modification of the example also provides a good experience when using domestic AI to look up English words in the "Oxford Advanced Learner's English-Chinese Dictionary.")
o06s: Analysis of structural patterns of noun phrases containing Chinese gerunds - Based on BCCWJ data -
When Chinese gerunds are used in noun phrases, there are at least three structural patterns: verb type ("Chinese + suru/shita"), noun type ("Chinese + no"), and adjective type ("Chinese + teki/teki na/na"). The results confirmed that (1) the typicality of the verb-type structural pattern is prominent, (2) there are constraints on the noun-type structural pattern, and (3) the adjective-type structural pattern is exceptional. It was also revealed that factors such as the part of speech of the Chinese gerund, usage environment, semantic category, and era influence the selection of each pattern.
In the papers recommended by my supervisor while writing my thesis, there was an article by the author, and I didn't expect to encounter it this time. The direction and conclusions are quite interesting.
14:10〜15:50#
o07s: Construction of the "Chinese Video Audio Corpus" - Aiming for accurate transcription through multiple modalities
I originally planned to write something similar to [[Conan Bilingual Corpus]], but I really didn't have time to work on it before finishing [[Easy to Check]]. I want to see what technology stack they used and what their needs are.
Chinese videos uploaded to video sharing sites generally have subtitles embedded as image data within the video frames. To enable broader text collection when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the video. In this study, we will implement an application that can simultaneously display and search text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect several genres and conduct language analysis.
16:15 〜 17:15#
i1_A3s A Room: An attempt at readable accent notation for a Japanese-Slovenian dictionary for Japanese learners
I didn't expect there would be scholars sharing their experiences in constructing a Japanese-Slovenian dictionary, and the shared experience is about the processing of UniDic, which is a must-see! (Additionally, I hadn't noticed that UniDic also contains pitch information.)
i1_B3s: An attempt to extract candidate words for onomatopoeia using pattern matching - Using an onomatopoeia morphological transformation program -
It has been revealed that there are 61 types of morphological patterns for onomatopoeia appearing in modern Japanese written and spoken language, with about 2200 concrete forms.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_B3s-paper.pdf
Researching input methods...? My [[non-dictionary]] and input text are actually very similar processes, but I only vaguely noticed that Japanese people are very flexible when using hiragana, but I didn't expect that onomatopoeia could be divided into 61 types.
i1_C2: Characteristics of English vocabulary not adopted as loanwords in Japanese
This presentation focuses on English loanwords that have not been adopted into Japanese and clarifies some of their characteristics. It is well known that many loanwords from English exist in modern Japanese. However, not all English words have become loanwords in Japanese; for example, frequently used articles like "a," adverbs like "as," and pronouns like "he" are not loanwords in Japanese (they are not included in the entries of national language dictionaries). In the results for the top 100 words, 49 were included in the entries of "Digital Daijisen," while 51 were not, which is almost half and half. When viewed by part of speech, all 8 nouns were included in the entries, while 5 out of 6 auxiliary verbs and 9 out of 12 pronouns were not included.
I previously answered a question on Zhihu [[Zhihu Answer: What are the Japanese words derived from English?]] https://www.zhihu.com/question/544356324/answer/2609385955. I originally planned to slack off in my thesis: analyzing the intersection of Japanese loanwords and vocabulary from exams like the Chinese CET-4, IELTS, and TOEFL, but in the end, I couldn't resist choosing the [[non-dictionary]] morphological analysis direction (it's a pity that I only ended up writing half of it 2333).
Day 2: August 29 (Thursday)#
9:20 〜 10:40#
i2_A1: Interim report on the construction of the "Japanese Game Corpus (JGC)" - Quantitative characteristics observed in early action games -
A game corpus?! A must-see! Also, the selected games are all console games from Japanese manufacturers, both new and old (unfortunately, no Genshin Impact, what a pity).
i2_A2: (Tentative) An attempt at Japanese research using "National Diet Library Digital Materials Full Text Data"
I'm curious about how academia searches for what they want using already publicly available databases.
i2_A3: Examination of the "Classification Vocabulary Table" as a polysemous code for number - Using the most important verbs from the "Basic Dictionary of Japanese for Computers IPAL" -
Several presentations at this workshop have used this "Classification Vocabulary Table," and I'm curious about what issues were considered during numbering.
i2_B3: Design, implementation, and operation of a Japanese morphological analysis system for pop-up dictionaries
It is said that hovering the mouse over a word to display the dictionary can enhance reading efficiency. However, to achieve this function, it is necessary to solve the problem of converting the string under the mouse pointer into dictionary form. Using morphological analysis systems like Mecab is one solution, but such systems often require specific performance from the user's computer, so they are usually run on servers. However, the morphological analysis in this process differs from that for language research, machine translation, or full-text search, as the main purpose is to convert the input string into dictionary form. Therefore, it is possible to reduce the size of the morphological analysis system and enable a more efficient implementation. This paper discusses the design, implementation, and operation of a morphological analysis system specialized for dictionary retrieval, NonJishoKei.
It has been proven that automatically displaying dictionary explanations when hovering over words can effectively improve reading efficiency. However, to achieve this function, a problem must be solved: converting the text near the mouse pointer into dictionary form. Using morphological analyzers like Mecab is one solution, but such systems often have high requirements for the user's device, so they are typically run on servers. However, unlike language research, machine translation, or full-text search, this scenario only requires converting the text near the mouse pointer into dictionary form. In other words, a streamlined morphological analyzer can be specifically designed for such usage scenarios. The Japanese Non-Dictionary Form Dictionary (NonJishoKei) is a morphological analyzer designed for pop-up dictionary retrieval based on this idea, and this paper will discuss its algorithm principles and engineering implementation.
My own presentation (the truth is revealed 2333), the translation is a rewrite after I submitted the original text, so it differs quite a bit (囧).
i2_C2: TEachOtherS, a writing education support system as a learner corpus construction mechanism
(a) Provides learners with a web-based writing, commenting, and reflection environment, (b) Allows teachers to manage accounts for the entire class and control activity phases such as writing, commenting, and reflection, which can be applied to the entire class at once. In addition, it is assumed that learners will revise their writing based on comments received from others, and it has a version control function for writing. The results of writing education activities can be output in HTML format.
I am very interested in the implementation details of this system.
i2_C4: (Tentative) Trends in writing errors in handwritten kanji by high school students
In the first year, about 70% of students' essays showed kanji writing errors, but as the grade increased, the errors decreased, and in the third year, they decreased to about 50%. Among the kanji used in more than 20 essays, the kanji with the highest error rate was "達," and about 40% of the essays containing "達" showed errors in its form.
The conclusions regarding the issues of interest are very intriguing.
10:50〜12:05#
o12: (Tentative) Characteristics of anime and game vocabulary from the perspective of misanalysis - Towards the creation of a vocabulary list -
Anime and games are one of the resources for Japanese learners, but the vocabulary used differs from that learned in the classroom. However, there is no vocabulary list that is easy for both learners and teachers to use, showing vocabulary by genre and its frequency. Therefore, we decided to create a vocabulary list that can be utilized in Japanese education as a language resource. Scripts from anime and games tend to produce misanalysis when subjected to morphological analysis as they are. To provide accurate data, we first conducted morphological analysis on four anime works and one game to confirm where and to what extent misanalysis occurs. As a result, it was found that about 10% of misanalysis occurs, most of which reflects the characteristics of vocabulary in anime and games, including work-specific nouns, interjections, colloquial speech, and hesitations. This presentation will organize the procedures of morphological analysis conducted towards the creation of a vocabulary list and consider methods to analyze while retaining the characteristics of anime and games as much as possible.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o12-paper.pdf
I am personally very interested in the direction and the issue of "misanalysis" pointed out, and also, the anime studied includes "Oshi no Ko" and "The Quintessential Quintuplets" (laughs).
o13: Overview of the "Children's Daily Conversation Corpus" monitor public version
A children's dialogue corpus? Looking forward to it!
13:00〜14:00#
Linguistics deepening dialogue with generative AI
Presenter: Daiki Sano (Google LLC)
Wow, Google is impressive!
14:25〜15:25#
i3_A1: The relationship between rising and falling intonation and conversational forms - Using the "Japanese Daily Conversation Corpus" -
Presenter: Li Haiqi (Zhejiang University Japanese Department)
There are differences in opinions regarding the usage situations of rising and falling intonation, which is a sentence-final intonation. According to a summary based on introspection and data, rising and falling intonation tends to be used in somewhat formal situations. However, based on impression evaluation and usage rate statistics from data of monologues, rising and falling intonation is often used in casual speech.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A1-paper.pdf
The conclusions are very interesting.
i3_A2: (Tentative) Differences in speech speed by daily conversation situations
This presentation reports on the results of investigating how speech speed can vary depending on the conversation situation and conversation partner.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A2-paper.pdf
The title alone piqued my interest.
i3_A3: Pronunciation of the /ei/ vowel sequence in Japanese
Presenter: Katarina Hitomi Gerl (University of Ljubljana, Faculty of Arts, Japanese Studies)
According to various dictionaries, the /ei/ vowel sequence in Japanese is pronounced as a long "e" when it is not between breaks in meaning.
The issues of interest are very intriguing.
i3_B3: Construction of a Slovenian-Japanese learning dictionary based on dictionary inversion and open data
Presenter: Kristina Hmeljak Sangawa (University of Ljubljana), Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Jan Hrastnik, Nina Kališnik, Sara Kleč, Eva Kovač, Nina Sangawa Hmeljak, Jure Tomše, and Tomaž Erjavec
Japanese language learning is popular in Slovenia, but there are still few reference books. Therefore, we attempted to invert the data of the previously edited Japanese-Slovenian dictionary and utilize open data to construct a Slovenian-Japanese learning dictionary. First, we extracted equivalent words for each meaning from the Japanese-Slovenian dictionary, rearranged them with Slovenian as the headword, then manually removed duplicates and inappropriate headwords, and automatically assigned part of speech and CEFR-compliant difficulty levels, along with example sentences for some headwords. Using the collaborative editing software Lexonomy, we manually assigned meaning hints and positional labels for polysemous headwords, and some headwords were also accompanied by example sentences from parallel corpora. The approximately 8500-word dictionary data constructed in this way was made publicly available as TEI Lex0 compliant XML data. Learners who participated in the project reported that they gained knowledge about the structure of dictionaries, and we plan to continue editing in the same manner in the future.
The introduction is very appealing to me, and I look forward to the upcoming presentation.
i3_C2: Personal emergencies: Analysis of "wait" on X (Twitter)
This analysis focuses on the usage and characteristics of the imperative "wait" written as the sender's own words without accompanying other elements representing the subject or object in the same sentence on X (Twitter). Observations of examples posted in the last 60 minutes revealed that such "wait" is used more frequently than similar expressions like "look" and "listen," and is often used in "tweets" (posts) that do not have a specific recipient. Furthermore, it is believed that such "wait" often co-occurs with the sender's emotions or evaluations, indicating that "there is some event that shakes emotions or evaluations, and it is an emergency situation that literally requires the sender to wait." Additionally, comparisons were made with examples from Yahoo! Blogs and LINE chats, suggesting that such "wait" is particularly likely to be used on X (Twitter).
The analytical subject is very interesting.
15:35〜16:50#
o15: A corpus-based cognitive semantic analysis of the polysemy of the Japanese temperature adjective tsumetai
Presenters: Wang Haitao (Kyoto University), Huang Haihong (Kyoto University), Zhong Yong (Nanjing University of Aeronautics and Astronautics)
A Chinese person submitting an English paper on Japanese...? I'm curious what language will be used for the presentation at that time 2333.
o16: The use of sentence-final forms in distinguishing character dialogue in novels
This paper attempts to collect, organize, and analyze sentence-final forms from the dialogues of 24 characters appearing in 10 entertainment novels and light novels.
I thought the title was about analyzing some classic Japanese literature, but the introduction turned out to be "analyzing the language styles of different characters in 10 light novels," which instantly caught my attention. Upon opening the paper, I found that one of the analyzed works is "Rascal Does Not Dream of Bunny Girl Senpai"! Moreover, there are also new works like "Frylin's Funeral"… So can I expect someone to analyze "MyGo" at next year's seminar? (what a fog)