Conference Agenda_Language Resources Workshop 2024

The online seminar on Japanese corpora/computational linguistics "Language Resource Workshop 2024" is here!

Language Resource Workshop 2024#

An online seminar related to corpora and computational linguistics hosted by the National Institute for Japanese Language and Linguistics. Please fill out the registration form on the official website before attending the conference https://clrd.ninjal.ac.jp/lrw2024.html (free).

Next, I will list some of the presentations I am interested in; the complete conference content can be viewed on the official website: https://clrd.ninjal.ac.jp/lrw2024-programme.html.

Additionally, the schedule for the academic conference "68th Annual Meeting of the Society for Quantitative Linguistics" hosted by the National Institute for Japanese Language and Linguistics has also been released. Since it is held offline, you can visit the official website for more information if you are interested.

https://sites.google.com/view/mathling2024/%E3%83%9B%E3%83%BC%E3%83%A0

Day 1: August 28 (Wednesday)#

09:30〜10:45#

o01: [[The occurrence of "sentence inclusion" in conversation data]]

https://clrd.ninjal.ac.jp/lrw/lrw2024/o01-paper.pdf

What is 【sentence inclusion】: Language expressions such as "hurry up aura," "I'm trying hard appeal," and "let's start the Pokémon card game campaign" are unique linguistic phenomena that deviate from the general word formation rules, where elements equivalent to "sentences" occur within words, and larger units cannot fit within the word (referred to as "sentence inclusion" in this presentation).

I have collected a large number of example sentences from anime subtitles while researching [[non-dictionary]], many of which do not conform to standard Japanese grammar and are similar to the 【sentence inclusion】 this presentation wants to discuss. I want to see how academia views these less standard examples.

10:55〜12:10#

o04s: [[Verification of the effectiveness of large-scale language models for meaning classification of katakana words]]

This paper reports on the methods and results of meaning classification of katakana words in context using LLM.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o04s-paper.pdf

Meaning classification? I'm curious how it's done; I designed a prompt in this direction:

# Role: Dictionary Query Assistant

## Profile

- Author: NoHeartPen
- Version: 0.1
- Description: The Dictionary Query Assistant searches for the meaning closest to the context from the complete explanations provided by authoritative dictionaries.

## Rules
1. Respect the original text; do not translate the complete explanations provided by the dictionary, and do not modify the complete explanations provided by the dictionary.
2. When the context contains usages not included in the dictionary, return "The dictionary has not recorded this usage." At other times, no additional explanation is needed; just return the dictionary explanation.

## Workflow
1. Ask the user to provide context in the format "Context: [], Word to query: [], Complete dictionary explanation: []".
2. Analyze the closest explanation in the complete dictionary explanation provided by the user to the context and the word to be queried.
3. Only return the relevant explanation closest to the context; do not return other explanations unrelated to the context.
4. No need to translate the dictionary explanation or provide any additional explanation.

## Initialization
As the role <Role>, strictly adhere to <Rules>, and warmly welcome the user. Then introduce yourself and inform the user about <Workflow>.

## Example
Context: [全部さらけ出して], Word to query: [さらけ出して], Complete dictionary explanation: [さらけ‐だ・す【×曝け出す】  
［動サ五（四）］  
① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」  
② 追い出す。  
「おらあ女房を―・してしまって」〈滑・膝栗毛・発端〉]
Your answer: ① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」

(Note: This prompt performs poorly on GPT3.5 and many domestic AIs, but works well on GPT4o mini, allowing for quick searches for the closest meanings in authoritative dictionaries like "Daijisen" among a vast number of meanings. A slight modification of the example also provides a good experience when querying English words using domestic AI in the "Oxford Advanced Learner's English-Chinese Dictionary.")

o06s: Analysis of structural patterns of noun clauses containing Chinese gerunds - Based on BCCWJ data -

When Chinese gerunds are used within noun clauses, there are at least three structural patterns: verb type ("Chinese + suru/shita"), noun type ("Chinese + no"), and adjective type ("Chinese + teki/teki na/na"). The results confirmed that (1) the typicality of the verb-type structural pattern is prominent, (2) the noun-type structural pattern has constraints, and (3) the adjective-type structural pattern is exceptional. It was also revealed that factors such as the part of speech of Chinese gerunds, usage environment, meaning categories, and era influence the selection of each pattern.

https://clrd.ninjal.ac.jp/lrw/lrw2024/o06s-paper.pdf

In the papers recommended by my advisor while writing my graduation thesis, there was an article by the author, and I didn't expect to encounter it this time; both the direction and conclusion are quite interesting.

14:10〜15:50#

o07s: Construction of the "Chinese Video Audio Corpus" - Aiming for accurate transcription through multiple modalities

https://clrd.ninjal.ac.jp/lrw/lrw2024/o07s-paper.pdf

I originally planned to write something similar to [[Conan Bilingual Corpus]], but I really didn't have time to do it before finishing [[Easy Query]], so I want to see what technology stack they used and what their needs are.

Chinese videos uploaded to video-sharing sites typically have subtitles embedded as image data within the video frames. To enable the collection of a wider range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that allows simultaneous display and search of text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect several genres and conduct language analysis.

16:15 〜 17:15#

i1_A3s A Room: An attempt at readable accent notation for a Japanese-Slovenian dictionary for Japanese learners

https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_A3s-paper.pdf

I didn't expect there would be scholars sharing experiences in constructing a Japanese-Slovenian dictionary, and the shared experience is about the processing of UniDic, which is a must-see! (Additionally, I hadn't noticed that UniDic also contains tone information.)

i1_B3s: An attempt to extract candidate words for onomatopoeia using pattern matching - Using an onomatopoeia morphological transformation program -

It has been revealed that there are 61 types of morphological patterns of onomatopoeia appearing in modern Japanese written and spoken language, with about 2200 actual forms.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_B3s-paper.pdf

Researching input methods...? My own [[non-dictionary]] and input text are actually very similar processes, but I only vaguely noticed that Japanese people are very flexible when using hiragana, but I didn't expect that onomatopoeia could be divided into 61 types.

i1_C2: Characteristics of English vocabulary not adopted as loanwords in Japanese

This presentation focuses on English loanwords that have not been adopted into Japanese and clarifies some of their characteristics. It is well known that there are many English loanwords in modern Japanese. However, not all English words have become loanwords in Japanese; for example, frequently used articles like "a," adverbs like "as," and pronouns like "he" have not become loanwords in Japanese (they are not included in the entries of national language dictionaries). In the results of the top 100 words, 49 were included in the "Digital Daijisen" entries, while 51 were not, almost evenly split. Looking at it by part of speech, all 8 nouns were included in the entries, while 5 out of 6 auxiliary verbs and 9 out of 12 pronouns were not included.

I previously answered a question on Zhihu [[What are the Japanese words derived from English]] https://www.zhihu.com/question/544356324/answer/2609385955, and I originally planned to slack off in my graduation thesis: analyzing the intersection of Japanese loanwords and vocabulary from exams like the Chinese CET-4, IELTS, TOEFL, etc., but ultimately couldn't resist choosing the direction of [[non-dictionary]] morphological analysis (it's a pity I only wrote a half-finished piece 2333).

Day 2: August 29 (Thursday)#

9:20 〜 10:40#

i2_A1: Interim report on the construction of the "Japanese Game Corpus (JGC)" - Quantitative characteristics observed in early action games -

https://clrd.ninjal.ac.jp/lrw/lrw2024/i2_A1-paper.pdf

A game corpus?! A must-see! Additionally, the selected games are all console games from Japanese manufacturers, both new and old (unfortunately, there is no Genshin Impact, what a pity).

i2_A2: (Tentative) An attempt at Japanese research using "National Diet Library Digital Materials Full Text Data"

I'm curious how academia searches for what they want using publicly available databases.

i2_A3: Examination of polysemous codes for "Classification Vocabulary Table" - Using the most important verbs from the "Basic Dictionary of Japanese for Computers IPAL" -

Several presentations at this seminar have used the "Classification Vocabulary Table," and I'm curious about the issues considered during numbering.

i2_B3: Design, implementation, and operation of a Japanese morphological analysis system for popup dictionaries

It is said that hovering the mouse over a word to display the dictionary can enhance reading efficiency. However, to achieve this function, it is necessary to solve the problem of converting the string under the mouse pointer into dictionary form. Using morphological analysis systems like Mecab is one solution, but such systems often require specific performance from the user's computer, so they are usually run on servers. However, the morphological analysis in this process differs from that for language research, machine translation, or full-text search, as the main purpose is to convert the input string into dictionary form. Therefore, it is possible to reduce the size of the morphological analysis system and enable more efficient implementation. This paper discusses the design, implementation, and operation of a morphological analysis system specialized for dictionary searches, NonJishoKei.
It has been proven that automatically displaying dictionary explanations when the mouse hovers over a word can effectively improve reading efficiency. However, to realize this function, it is necessary to solve a problem: converting the text near the mouse pointer into dictionary form. Using morphological analyzers like Mecab is one solution, but these systems often have high requirements for the user's device, so they are usually run on servers. However, unlike language research, machine translation, or full-text search, this scenario only requires converting the text near the mouse pointer into dictionary form. In other words, a streamlined morphological analyzer can be specifically designed for such usage scenarios. The Japanese Non-Dictionary Morphological Dictionary (NonJishoKei) is based on this idea, and this paper will discuss its algorithmic principles and engineering implementation.

My own presentation (the truth is revealed 2333), the translation is a rewrite after I submitted the original text, so it differs quite a bit (囧).

i2_C2: TEachOtherS, a writing education support system as a learner corpus construction mechanism

(a) Provides learners with a web-based environment for writing, commenting, and reflection, (b) Allows teachers to manage accounts for the entire class and control activity phases such as writing, commenting, and reflection, applying them collectively to the whole class. In addition, it is assumed that revisions will be made based on comments received from others, and it has a version management function for writing. The results of writing education activities can also be output in HTML format.

I'm very interested in the implementation details of this system.

i2_C4: (Tentative) Trends in writing errors in handwritten kanji by high school students

In the first year, about 70% of students' compositions showed kanji writing errors, but as the grade increased, the errors decreased, and in the third year, it decreased to about 50%. Among the kanji used in more than 20 compositions, the kanji with the highest error rate was "達，" with about 40% of the compositions containing errors in the character form of "達."

The conclusions on the issues of concern are very interesting.

10:50〜12:05#

o12: (Tentative) Characteristics of anime and game vocabulary from the perspective of misanalysis - Towards the creation of a vocabulary list -

Anime and games are one of the resources for Japanese learners, but the vocabulary used differs from that learned in the classroom. However, there is no vocabulary list that is easy for both learners and teachers to utilize, showing genre-specific vocabulary and its frequency. Therefore, we decided to create a vocabulary list that can be utilized in Japanese education. Scripts from anime and games tend to produce misanalysis when subjected to morphological analysis. Aiming for accurate data provision, we first conducted morphological analysis on four anime works and one game to confirm where and to what extent misanalysis occurs. As a result, it was found that about 10% of misanalysis occurs, most of which reflects the characteristics of vocabulary in anime and games, including unique nouns, interjections, colloquial language, and hesitations. This presentation will organize the procedures of morphological analysis conducted for the creation of the vocabulary list and consider methods to analyze while retaining the characteristics of anime and games as much as possible.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o12-paper.pdf

I am personally very interested in the direction and the issue of 【misanalysis】 pointed out, and also, the anime studied includes 【Oshi no Ko】 and 【The Quintessential Quintuplets】 (laughs).

o13: Overview of the "Children's Daily Conversation Corpus" monitor public version

https://clrd.ninjal.ac.jp/lrw/lrw2024/o13-paper.pdf

A children's dialogue corpus? Looking forward to it!

13:00〜14:00#

Linguistics deepening dialogue with generative AI
Presenter: Taiki Sano (Google LLC)

Wow, Google is impressive!

14:25〜15:25#

i3_A1: The relationship between rising and falling intonation and conversation format - Using the "Japanese Daily Conversation Corpus" -

Presenter: Li Haiqi (Zhejiang University Japanese Department)
There are differences in opinions regarding the usage situations of the rising and falling intonation, which is a sentence-final intonation. According to a summary based on introspection and materials, rising and falling intonation tends to be used in slightly formal situations. However, based on impression assessment and usage rate statistics using monologues as data, rising and falling intonation is frequently used in casual speech.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A1-paper.pdf

The conclusions are very interesting.

i3_A2: (Tentative) Differences in speech speed by conversation situation

This presentation reports on the results of investigating how speech speed can change depending on the conversation situation and conversation partner.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A2-paper.pdf

The title itself piqued my interest.

i3_A3: Pronunciation of the /ei/ vowel sequence in Japanese

Presenter: Katarina Hitomi Gerl (University of Ljubljana, Faculty of Arts, Japanese Studies)
According to various dictionaries, the /ei/ vowel sequence in Japanese is pronounced as a long "e" when it is not between meaning breaks.

The issues of concern are very interesting.

i3_B3: Construction of a Slovenian-Japanese learning dictionary based on dictionary inversion and open data
Presenter: Kristina Hmeljak Sangawa (University of Ljubljana), Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Jan Hrastnik, Nina Kališnik, Sara Kleč, Eva Kovač, Nina Sangawa Hmeljak, Jure Tomše, and Tomaž Erjavec
Japanese language learning is popular in Slovenia, but there are still few reference books. Therefore, we attempted to invert the data of the previously edited Japanese-Slovenian dictionary and utilize open data to construct a Slovenian-Japanese learning dictionary. First, we extracted equivalent words for each meaning from the Japanese-Slovenian dictionary, rearranged them with Slovenian as the headword, manually removed duplicates and inappropriate headwords, and automatically assigned part of speech and CEFR-compliant difficulty levels to the headwords, with some including example sentences. Using collaborative editing software Lexonomy, we manually added meaning hints and positional labels to polysemous headwords, and some headwords were also accompanied by example sentences from parallel corpora. The approximately 8500 vocabulary data constructed in this way was made publicly available as TEI Lex0 compliant XML data. Learners participating in the project reported that they gained knowledge about the structure of the dictionary, and we plan to continue editing in the same manner in the future.

The introduction is very appealing to me, and I look forward to the upcoming presentation.

i3_C2: Personal emergencies: Analysis of "wait" on X (Twitter)

This study focused on the usage and characteristics of the imperative "wait," written as the sender's own words without accompanying other elements representing the subject or object in the same sentence. Based on observations of examples posted in the last 60 minutes, it was found that such "wait" is used more frequently than similar expressions like "look" and "listen," and is often used in "tweets" (posts) without specific recipients. Furthermore, it is believed that such "wait" often co-occurs with the sender's (writer's) emotions or evaluations, indicating that "there is some event that shakes emotions or evaluations, and it is an emergency situation that literally requires the sender (writer) to wait." Additionally, comparisons were made with examples from Yahoo! Blogs and LINE chats, suggesting that such "wait" is particularly likely to be used on X (Twitter).

https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_C2-paper.pdf

The analytical subject is very interesting.

15:35〜16:50#

o15: A corpus-based cognitive semantic analysis of the polysemy of the Japanese temperature adjective tsumetai

Presenters: Wang Haitao (Kyoto University), Huang Haihong (Kyoto University), Zhong Yong (Nanjing University of Aeronautics and Astronautics)

https://clrd.ninjal.ac.jp/lrw/lrw2024/o15-paper.pdf

A Chinese person submitted a paper in English about Japanese...? I'm curious what language will be used for the presentation at that time 2333.

o16: The use of sentence-final forms in distinguishing characters' dialogue in novels

This paper attempts to collect, organize, and analyze the sentence-final forms from the dialogues of 24 characters appearing in 10 entertainment novels and light novels.

https://clrd.ninjal.ac.jp/lrw/lrw2024/o16-paper.pdf

I thought the title was about analyzing some classic Japanese literature, but the introduction turned out to be "analyzing the language styles of different characters in 10 light novels," which instantly energized me. Upon opening the paper, I found that one of the analyzed works is "Bunny Girl Senpai"! Moreover, there are also new works like "Frylin's Funeral"… so can I expect someone to analyze "MyGo" at next year's seminar? (what a fog)