An ASR system trained on field recordings of Ikema achieves around 15% character error rate and materially speeds up human transcription, cutting labor time and reducing cognitive load; this suggests practical AI-assisted pathways for scaling documentation of severely endangered languages.

Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

Chihiro Taguchi, Yukinori Takubo, David Chiang · March 27, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Training an ASR on a field-recorded Ikema corpus yields ~15% character error rate and, when integrated into the transcription workflow, substantially reduces human transcription time and cognitive load.

Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a {\totaldatasethours}-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15\%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.

Summary

Main Finding

Fine-tuning a small multilingual Wav2Vec2 model on a 6.33-hour Ikema Miyakoan corpus yields a usable ASR (best CER 14.8% on romaji), can be deployed locally (≈1.2 GB), and—when integrated into annotation software—speeds up human transcription by ≈19–23%, materially reducing the transcription bottleneck in endangered-language documentation.

Key Points

Dataset: first Ikema speech corpus (6.33 hours) assembled from field recordings, spoken dictionary entries, and audiobooks; transcriptions in kana with deterministic romaji mapping; rich tagging (code-switch, disfluency, names, unsure).
Model & performance:
- Fine-tuned Wav2Vec2 CTC models (xls‑r‑300m, xls‑r‑1b, mms‑1b).
- Best result: xls-r-300m, romaji CER = 14.80% (kana CER = 20.94%); WER remains high because word boundaries are ambiguous.
- Smaller 300M model outperformed larger 1B variants in this setup.
Training & deployment:
- Training on a single A10 GPU: ~10 hours for xls-r-300m vs >50 hours for 1B models.
- Model footprint: ~1.2 GB (300M) vs ~3.8 GB (1B); smaller model is more practical for local use and integration into ELAN.
Human evaluation:
- ASR integrated into ELAN; two annotators transcribed the same segments with/without ASR drafts.
- Measured annotation speedups: Annotator A (experienced) +19.4%, Annotator B (novice) +23.3%.
- Converted into labor multipliers: baseline human-hours per audio hour ≈ 6.32× (Annotator A) and 16.80× (Annotator B); ASR reduced these to ≈5.29× and 13.63×, saving ≈1.03 and 3.17 human-hours per audio hour respectively.
Ethics/community: close collaboration, informed consent, PII tagging, sensitivity about imposing kana orthography; dataset/model/code publicly released with community consultation.

Data & Methods

Data composition:
- Field recordings (mostly semi-spontaneous monologues by a primary male speaker), dictionary pronunciations (many <1s), and audiobooks (female reader).
- Total segments ≈17k, total duration ≈6.33 hours.
- Transcription style: phonetically faithful kana (includes fillers/disfluencies); romaji obtained deterministically.
Modeling:
- Pretrained Wav2Vec2 variants fine-tuned with a CTC decoder to output characters/graphemes (kana morae or romaji phonemes).
- Vocabulary: kana tokens based on morae; romaji tokens based on phonemes.
- Training split: 80% train / 10% dev / 10% test. LR=3e-4, batch sizes 16 (300M) / 4 (1B), 50 epochs.
- Metrics: character error rate (CER) prioritized over WER due to inconsistent word boundaries and orthographic variation.
Integration & evaluation:
- Deployed model as an ELAN extension to generate ASR drafts for annotators.
- Controlled within-annotator comparison on the same audio segments, measuring wall-clock annotation time and qualitative cognitive load.

Implications for AI Economics

Labor productivity and cost savings
- ASR drafts reduce human transcription time by ~19–23% in this case study. In per-audio-hour terms, this equates to roughly 1.0–3.2 fewer human-hours required per audio hour depending on annotator experience — substantial when documentation requires many hours of annotation.
- For a small dataset (6.33 hours), extrapolated human-hours saved are material (e.g., ≈6.5 hours saved for an experienced annotator; ≈20 hours for a novice). For large-scale projects, savings scale linearly with annotated hours.
Development vs. recurring costs
- Upfront fine-tuning cost here was low: ~10 GPU-hours on a single A10 for an effective small model. Larger models required far more compute (>50 GPU-hours) but did not deliver better results in this setting—implying diminishing returns to scale for model size on low-resource language fine-tuning.
- Small models with small footprints (≈1.2 GB) enable low-cost local deployment (no heavy cloud inference fees) and easier integration into community tools, lowering marginal cost of each additional hour of ASR-assisted transcription.
Public goods and infrastructure
- Releasing dataset, model, and code creates a public good that reduces duplication of effort across other endangered-language projects. Shared pretrained multilingual representations plus small-scale fine-tuning appear to be a high-leverage investment for many low-resource languages.
Market and policy implications
- There is a viable market and social-return case for funding lightweight ASR toolchains targeted at documentation: relatively low compute cost, sizable labor savings, and high societal value (language preservation).
- Funders and institutions should prioritize: (1) multilingual pretraining and public checkpoints that enable cheap fine-tuning, (2) tooling that integrates ASR into annotator workflows (ELAN plugins, offline inference), and (3) community-centered data governance to manage ethical/externality risks.
Distributional effects and risks
- Productivity gains can change labor demand for skilled transcribers—shifting work from raw transcription to post-editing, validation, and linguistic analysis. Training and compensation models should reflect this change.
- Potential negative externalities: data ownership, misuse, and imposition of a writing system. Mitigation requires community consent, PII controls, licensing, and participatory governance.
Recommendations for stakeholders
- Research funders: invest in multilingual self-supervised models and small-scale transfer pipelines optimized for low-resource languages.
- Project managers: prioritize small-model fine-tuning and local deployment to minimize costs and ensure community control.
- Policymakers/ethics boards: require community consultation, explicit data-licensing decisions, and support capacity building so communities gain control over their digital language assets.

Overall, the paper demonstrates that modest investments in pretraining/fine-tuning and tool integration yield outsized economic benefits (labor savings, scalable workflows) for endangered-language documentation, while highlighting the need for responsible governance and community-centered deployment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides direct empirical measurements (ASR performance on held-out data and a user study measuring transcription time and cognitive load), which support claims about productivity gains; however, the scale appears limited to a single endangered language, sample sizes and randomization details are not reported in the summary, and task-specific factors (annotator skill, recording quality, within-subject learning/order effects) could confound estimates. Methods Rigormedium — Strengths include creation of a field-recorded speech corpus and standard ASR evaluation (character error rate) plus a human-subject evaluation of transcription efficiency; limitations include likely small and convenience samples (both speakers and annotators), potential lack of pre-registered protocol or full randomization/counterbalancing in the user study, and limited information on test/train splits, speaker independence, and robustness checks. SampleA speech corpus constructed from field recordings of Ikema (a severely endangered Ryukyuan language spoken in Okinawa; population ≈1,300, speakers predominantly over age 60). The manuscript reports the corpus size as {\totaldatasethours}-hour (value referenced but not specified here). ASR training and evaluation used held-out test segments from the same field recordings; the transcription-efficiency evaluation involved human annotators transcribing recordings with and without ASR assistance (participant count and composition not provided in the summary). Themesproductivity human_ai_collab IdentificationComparison of transcription tasks performed with and without ASR assistance (likely a controlled within-subject or between-subject user study) together with held-out ASR test-set evaluation (CER). Causal claims about time/cognitive-load reductions rest on the experimental comparison of assisted vs unassisted conditions rather than on random assignment to population-level interventions. GeneralizabilityLanguage-specific: results may not generalize beyond Ikema or closely related Ryukyuan dialects., Recording conditions: field-recording quality and speaker demographics (mostly elderly) limit transferability to other domains or age groups., ASR tuned to same-corpus data: performance may degrade on out-of-domain speech or different recording setups., Annotator sample: productivity gains depend on transcriber expertise (linguists vs. community members) and may not generalize across annotator populations., Scale: effects observed on a small, low-resource corpus may not scale linearly to larger documentation projects or commercial transcription settings.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We construct a {\totaldatasethours}-hour speech corpus from field recordings. Other	positive	high	size of speech corpus (hours)	0.48
We train an ASR model that achieves a character error rate as low as 15%. Error Rate	positive	high	character error rate	15% 0.48
Ikema is a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. Other	negative	high	number and age distribution of speakers	n=1300 0.24
ASR integration can substantially reduce transcription time. Task Completion Time	positive	medium	transcription time	0.29
ASR integration can substantially reduce cognitive load for transcribers. Worker Satisfaction	positive	medium	cognitive load of transcribers	0.29
ASR-assisted transcription offers a practical pathway toward scalable, technology-supported documentation of endangered languages. Adoption Rate	positive	medium	scalability of language documentation (feasibility/adoption implications)	0.05
Automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. Other	positive	high	utility/potential of ASR for endangered-language transcription	0.24