Conference Agenda_Language Resources Workshop 2024

The online seminar on Japanese corpus linguistics "Language Resources Workshop 2024" is here!

Language Resources Workshop 2024#

An online seminar related to corpora and computational linguistics hosted by the National Institute for Japanese Language and Linguistics. Please fill out the registration form on the official website before attending the conference https://clrd.ninjal.ac.jp/lrw2024.html (free).

Next, I will list some of the presentations I am interested in; the complete conference content can be viewed on the official website: https://clrd.ninjal.ac.jp/lrw2024-programme.html.

Additionally, the schedule for the academic conference "68th Annual Meeting of the Society for Quantitative Linguistics," hosted by the National Institute for Japanese Language and Linguistics, has also been released. Since it will be held offline, you can check the official website for more information if you're interested.

https://sites.google.com/view/mathling2024/%E3%83%9B%E3%83%BC%E3%83%A0

Day 1: August 28 (Wednesday)#

09:30〜10:45#

o01: [[The Occurrence of "Inclusion of Sentences" in Conversation Data]]

https://clrd.ninjal.ac.jp/lrw/lrw2024/o01-paper.pdf

What is 【Inclusion of Sentences】: Language expressions such as "Hurry up aura," "I'm trying hard appeal," and "Let's start the Pokémon card game campaign" contain elements equivalent to "sentences" occurring within words, which is a unique linguistic phenomenon that deviates from the general rules of word formation that larger units cannot occur within words (this presentation refers to it as "Inclusion of Sentences").

I have collected a large number of example sentences from anime subtitles while researching [[Non-dictionary]], many of which do not conform to standard Japanese grammar and are similar to the "Inclusion of Sentences" discussed in this presentation. I would like to see how academia views these less standard example sentences.

10:55〜12:10#

o04s: [[Verification of the Effectiveness of Large-scale Language Models for Meaning Classification of Katakana Words]]

This paper reports on the methods and results of meaning classification of Katakana words in context using LLM.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o04s-paper.pdf

Meaning classification? I'm curious about how it's done; I once designed a prompt in this direction:

# Role: Dictionary Query Assistant

## Profile

- Author: NoHeartPen
- Version: 0.1
- Description: The Dictionary Query Assistant searches for the closest meaning in context from the complete explanations provided by authoritative dictionaries.

## Rules
1. Respect the original text; do not translate the complete explanations provided by the dictionary, and do not modify them.
2. When a usage not yet included in the dictionary appears in the context, return "The dictionary has not included this usage." At other times, no additional explanation is needed; just return the dictionary explanation.

## Workflow
1. Ask the user to provide context in the format "Context: [], Word to query: [], Complete dictionary explanation: []".
2. Analyze the closest explanation in the context from the complete dictionary explanation provided by the user.
3. Only return the relevant explanation closest to the context; do not return unrelated explanations.
4. No need to translate the dictionary explanation or provide any additional notes.

## Initialization
As the role <Role>, strictly adhere to <Rules>, and warmly welcome the user. Then introduce yourself and explain <Workflow>.

## Example
Context: [全部さらけ出して], Word to query: [さらけ出して], Complete dictionary explanation: [さらけ‐だ・す【×曝け出す】  
［動サ五（四）］  
① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」  
② 追い出す。  
「おらあ女房を―・してしまって」〈滑・膝栗毛・発端〉]
Your answer: ① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」

(Note: This prompt performs poorly on GPT3.5 and many domestic AIs, but works well on GPT4o mini, allowing for quick searches of the most similar meanings in authoritative dictionaries like "Daijisen." A slight modification of the example also provides a good experience when using domestic AI to look up English words in the "Oxford Advanced Learner's English-Chinese Dictionary.")

o06s: Analysis of Structural Patterns of Noun Clauses Containing Chinese Gerunds - Based on BCCWJ Data

When Chinese gerunds are used within noun clauses, there are at least three structural patterns: verb type ("Chinese + suru/shita"), noun type ("Chinese + no"), and adjective type ("Chinese + teki/teki na/na"). The results confirm that (1) the typicality of the verb-type structural pattern is prominent, (2) the noun-type structural pattern has constraints, and (3) the adjective-type structural pattern is exceptional. Additionally, factors such as the part of speech of Chinese gerunds, usage environment, semantic categories, and era also influence the selection of each pattern.

https://clrd.ninjal.ac.jp/lrw/lrw2024/o06s-paper.pdf

In the papers recommended by my supervisor while writing my graduation thesis, there was an article by the author, and I didn't expect to encounter it this time; the direction and conclusions are quite interesting.

14:10〜15:50#

o07s: Construction of the "Chinese Video Audio Corpus" - Aiming for Accurate Transcription through Multiple Modalities

https://clrd.ninjal.ac.jp/lrw/lrw2024/o07s-paper.pdf

I originally planned to write something similar to [[Conan Bilingual Corpus]], but I really didn't have time to work on it before finishing [[Easy to Check]], so I want to see what technology stack they used and what their needs are.

Chinese videos uploaded to video-sharing sites typically have subtitles embedded as image data within the video frames. To enable the collection of a broader range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that allows simultaneous display and search of text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect several genres and conduct linguistic analysis.

For Chinese videos uploaded to video-sharing sites, subtitles are generally embedded as image data within the video frames. To enable the collection of a broader range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that allows simultaneous display and search of text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect several genres and conduct linguistic analysis.

16:15 〜 17:15#

i1_A3s A Room: An Attempt at Readable Accent Notation for a Japanese-Slovene Dictionary for Japanese Learners

https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_A3s-paper.pdf

I didn't expect there would be scholars sharing their experiences in constructing a Japanese-Slovene dictionary, and the sharing is about the processing experience of UniDic, which is a must-see! (Additionally, I hadn't noticed that UniDic also contains tone information.)

i1_B3s: An Attempt to Extract Onomatopoeia Candidate Words through Pattern Matching - Using an Onomatopoeia Morphological Transformation Program

It has been revealed that there are 61 types of morphological patterns of onomatopoeia appearing in modern Japanese written and spoken language, with approximately 2200 concrete forms.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_B3s-paper.pdf

Researching input methods...? My own [[Non-dictionary]] and input text are actually very similar processes, but I only vaguely noticed that Japanese people are very flexible when using hiragana, and I didn't expect onomatopoeia could be divided into 61 types.

i1_C2: Characteristics of English Vocabulary Not Adopted as Loanwords in Japanese

This presentation focuses on English loanwords that have not been adopted into Japanese and clarifies some of their characteristics. It is well known that modern Japanese contains many loanwords from English. However, not all English words have become loanwords in Japanese; for example, frequently used articles like "a," adverbs like "as," and pronouns like "he" are not loanwords in Japanese (they are not included in the entries of national dictionaries). ... Among the top 100 words, 49 are included in the "Digital Daijisen" entries, while 51 are not, making it almost half and half. When viewed by part of speech, all 8 nouns were included in the entries, while 5 out of 6 auxiliary verbs and 9 out of 12 pronouns were not included.

I previously answered a question on Zhihu [[What are the Japanese words derived from English]] https://www.zhihu.com/question/544356324/answer/2609385955, and I initially planned to coast through my graduation thesis: analyzing the intersection of Japanese loanwords and vocabulary from exams like the Chinese CET-4, IELTS, TOEFL, etc., but ultimately couldn't resist choosing the direction of [[Non-dictionary]] morphological analysis (it's a pity that I only ended up writing half of it 2333).

Day 2: August 29 (Thursday)#

9:20 〜 10:40#

i2_A1: Interim Report on the Construction of the "Japanese Game Corpus (JGC)" - Quantitative Features Observed in Early Action Games

https://clrd.ninjal.ac.jp/lrw/lrw2024/i2_A1-paper.pdf

A game corpus?! A must-see! Also, the selected games are all console games from Japanese manufacturers, both new and old (unfortunately, no Genshin Impact, what a shame).

i2_A2: (Tentative) An Attempt at Japanese Research Using "National Diet Library Digitalized Materials Full Text Data"

I'm curious about how academia searches for what they want using already publicly available databases.

i2_A3: Examination of the "Classification Vocabulary List" as a Polysemous Code - Using the Most Important Verbs from the "Basic Dictionary of Japanese for Computers IPAL"

Several presentations in this workshop have used this "Classification Vocabulary List," and I'm curious about what issues were considered during numbering.

i2_B3: Design, Implementation, and Operation of a Japanese Morphological Analysis System for Popup Dictionaries

It is said that hovering the mouse over a word to display the dictionary can enhance reading efficiency. However, to achieve this function, it is necessary to solve the problem of converting the string under the mouse pointer into dictionary form. Using morphological analysis systems like Mecab is one solution, but such systems often require specific performance from the user's computer, so they are usually run on servers. However, the morphological analysis in this process differs from that for language research, machine translation, or full-text search; the main purpose is to convert the input string into dictionary form. Therefore, it is possible to reduce the size of the morphological analysis system and enable more efficient implementation. This paper discusses the design, implementation, and operation of a morphological analysis system specialized for dictionary retrieval, NonJishoKei.
It has been shown that automatically displaying dictionary explanations when the mouse hovers over a word can effectively improve reading efficiency. However, to achieve this function, a problem must be solved: converting the text near the mouse pointer into a form included in the dictionary. Using morphological analyzers like Mecab is one solution, but such systems often have high requirements for the user's device, so they are usually run on servers. However, unlike language research, machine translation, or full-text search, this scenario only requires converting the text near the mouse pointer into a form included in the dictionary. In other words, a streamlined morphological analyzer can be specifically designed for such usage scenarios. The Japanese Non-dictionary Morphological Analyzer (NonJishoKei) is based on this idea, and this paper will discuss its algorithmic principles and engineering implementation.

My own presentation (the truth is revealed 2333), the translation is a rewrite after I submitted the original text, so it differs quite a bit (囧).

i2_C2: TEachOtherS: A Composition Education Support System as a Learner Corpus Construction Mechanism

(a) Provides learners with a web-based environment for composition, comments, and reflection, (b) Allows teachers to manage accounts for all students in the class and control activity phases such as composition, comments, and reflection, which can be applied to all students at once. In addition, it is assumed that compositions will be revised based on comments received from others, and it has a version management function for compositions. The results of composition education activities can also be output in HTML format.

I am very interested in the implementation details of this system.

i2_C4: (Tentative) Trends in Writing Errors in Handwritten Kanji by High School Students

In the first year, about 70% of students' compositions showed kanji writing errors, but as the grade increased, the errors decreased, and in the third year, it decreased to about 50%. Among the kanji used in more than 20 compositions, the kanji with the highest error rate was "達，" with about 40% of compositions containing errors in the character form of "達."

The conclusions of the issues I am concerned about are very interesting.

10:50〜12:05#

o12: (Tentative) Characteristics of Anime and Game Vocabulary from the Perspective of Misanalysis - Towards the Creation of a Vocabulary List

Anime and games are resources for Japanese language learners, but the vocabulary used differs from that taught in classrooms. However, there is no vocabulary list that is easy for both learners and teachers to utilize, showing genre-specific vocabulary and its frequency. Therefore, I decided to create a vocabulary list that can be used in Japanese education. Scripts from anime and games tend to produce misanalysis when subjected to morphological analysis. Aiming to provide accurate data, I first conducted morphological analysis on four anime works and one game to confirm where and to what extent misanalysis occurs. As a result, it was found that about 10% of misanalysis occurs, most of which reflects the characteristics of anime and game vocabulary, including unique nouns, interjections, colloquial speech, and hesitations. This presentation will organize the procedures of morphological analysis conducted for vocabulary list creation and consider methods to analyze while retaining the characteristics of anime and games as much as possible.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o12-paper.pdf

I am very interested in the direction and the issue of 【misanalysis】 pointed out, and also, the anime studied includes 【Oshi no Ko】 and 【Gotoubun no Hanayome】 (laughs).

o13: Overview of the Monitor Public Version of the "Children's Daily Conversation Corpus"

https://clrd.ninjal.ac.jp/lrw/lrw2024/o13-paper.pdf

A children's conversation corpus? Looking forward to it!

13:00〜14:00#

Linguistics Deepening Dialogue with Generative AI
Presenter: Taiki Sano (Google LLC)

Wow, Google is impressive!

14:25〜15:25#

i3_A1: The Relationship between Rising and Falling Intonation and Conversational Form - Using the "Japanese Daily Conversation Corpus"

Presenter: Li Haiqi (Zhejiang University Japanese Department)
There are differing views on the usage situations of rising and falling intonation, which is a sentence-final intonation. According to a summary based on introspection and data, rising and falling intonation tends to be used in somewhat formal situations. However, based on impression evaluations and usage rate statistics from data of monologues, rising and falling intonation is frequently used in casual speech.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A1-paper.pdf

The conclusions are very interesting.

i3_A2: (Tentative) Differences in Speech Rate by Daily Conversation Situations

This presentation reports on the results of investigating how speech rates can vary depending on conversation situations and conversation partners.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A2-paper.pdf

The title itself piqued my interest.

i3_A3: Pronunciation of /ei/ Vowel Sequences in Japanese

Presenter: Katarina Hitomi Gerl (University of Ljubljana, Faculty of Arts, Japanese Studies)
According to various dictionaries, the /ei/ vowel sequence in Japanese is pronounced as a long "e" when it does not occur between semantic breaks.

The issues of concern are very interesting.

i3_B3: Construction of a Slovene-Japanese Learning Dictionary Based on Dictionary Reversal and Open Data
Presenter: Kristina Hmeljak Sangawa (University of Ljubljana), Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Jan Hrastnik, Nina Kališnik, Sara Kleč, Eva Kovač, Nina Sangawa Hmeljak, Jure Tomše, and Tomaž Erjavec
Japanese language learning is popular in Slovenia, but reference books are still scarce. Therefore, we attempted to reverse the data of the previously edited Japanese-Slovene dictionary and utilize open data to construct a Slovene-Japanese learning dictionary. First, we extracted equivalent words from the Japanese-Slovene dictionary for each meaning, rearranged them with Slovene as the headword, then manually removed duplicates and inappropriate headwords, and automatically assigned parts of speech and CEFR-compliant difficulty levels to the headwords, with some example sentences attached. Using collaborative editing software Lexonomy, we manually added meaning hints and positional labels to polysemous headwords, and some headwords also included example sentences from parallel corpora. The approximately 8500 vocabulary entries constructed in this way were made publicly available as TEI Lex0 compliant XML data. Participants in the project reported that they gained knowledge about the structure of the dictionary, and we plan to continue editing in the same manner in the future.

The introduction is very appealing to me, and I look forward to the presentation later.

i3_C2: Personal Emergencies: Analysis of "Wait" on X (Twitter)

This analysis focuses on the usage and characteristics of the imperative "wait," written as the sender's own words without accompanying other elements representing the subject or object in the same sentence. Based on observations of examples posted in the last 60 minutes, it was found that such "wait" is used more frequently than similar expressions like "see" and "hear," and is often used in tweets (posts) that do not have a specific addressee. Furthermore, it is considered that such "wait" expresses "there is some event that shakes emotions and evaluations, and it is an emergency situation that literally requires the sender to wait." Additionally, comparisons were made with examples from Yahoo! Blogs and LINE chats, suggesting that such "wait" is particularly likely to be used on X (Twitter).

https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_C2-paper.pdf

The analytical subject is very interesting.

15:35〜16:50#

o15: A corpus-based cognitive semantic analysis of the polysemy of the Japanese temperature adjective tsumetai

Presenters: Wang Haitao (Kyoto University), Huang Haihong (Kyoto University), Zhong Yong (Nanjing University of Aeronautics and Astronautics)

https://clrd.ninjal.ac.jp/lrw/lrw2024/o15-paper.pdf

A Chinese person presenting an English paper on Japanese...? I'm curious what language will be used for the presentation 2333.

o16: The Use of Sentence-final Forms in Distinguishing Characters' Dialogue in Novels

This paper attempts to collect, organize, and analyze sentence-final forms from the dialogues of 24 characters appearing in 10 entertainment novels and light novels.

https://clrd.ninjal.ac.jp/lrw/lrw2024/o16-paper.pdf

I thought the title was analyzing some famous Japanese literary work, but the introduction turned out to be "analyzing the language styles of different characters in 10 light novels," which instantly caught my attention. Upon opening the paper, I found that the analyzed works include "Bunny Girl Senpai"! Moreover, there are also new works like "Furiko no Furi" ... Can I expect someone to analyze "MyGo" at next year's seminar? (what a fog)