Vietnamese Language Implementation Guide | ICE Encounter

Overview

Vietnamese language integration poses unique NLP challenges primarily due to its complex diacritical mark system and tonal nature. A single base character can adopt entirely different meanings depending on its accompanying diacritic.

Diacritical Complexity

The Tone System

Vietnamese is a tonal language with six distinct tones, each marked by diacritics:

Tone	Mark	Example	Meaning
Ngang (level)	none	ma	ghost
Sắc (rising)	á	má	mother
Huyền (falling)	à	mà	but
Hỏi (dipping)	ả	mả	tomb
Ngã (broken)	ã	mã	horse
Nặng (heavy)	ạ	mạ	rice seedling

Mobile Input Challenges

Problem	Frequency	Impact
Diacritics omitted for speed	Very common	Ambiguity in meaning
Autocorrect failures	Common	Incorrect word substitution
Keyboard limitations	Some devices	Missing diacritic options

Example: "ma" without diacritics could mean ghost, mother, but, tomb, horse, or rice seedling depending on context.

Context Inference Requirements

Approach	Implementation
N-gram context	Analyze surrounding words for disambiguation
Sentence-level parsing	Use full sentence to infer meaning
Clarification prompts	Ask user when ambiguity is high-stakes

Normalization Tools

underthesea Toolkit

The underthesea library is critical for Vietnamese NLP preprocessing.

Function	Purpose
Word segmentation	Identify word boundaries
POS tagging	Part-of-speech identification
Named entity recognition	Extract proper nouns, organizations
Text normalization	Standardize diacritics

Preprocessing Pipeline

from underthesea import word_tokenize, text_normalize

# Normalize text (fix diacritics, standardize)
raw_input = "toi can gap luat su ve di tru"
normalized = text_normalize(raw_input)
# Output: "tôi cần gặp luật sư về di trú"

# Word segmentation
tokens = word_tokenize(normalized)
# Output: ["tôi", "cần", "gặp", "luật sư", "về", "di trú"]

Diacritic Restoration

Technique	Accuracy	Use Case
Dictionary lookup	High for common words	Real-time restoration
Neural models	Highest overall	Batch processing
Context windows	Good with sufficient text	Conversation context

Recommended Models

Model Options

Model	Vietnamese Performance	Notes
Qwen3-235B	Strong	MoE architecture, large context
Llama 3.1 8B	Good	Accessible size, fine-tunable
PhoBERT	Excellent (classification)	Vietnamese-specific BERT
ViT5	Good (generation)	Vietnamese T5 variant

Model Selection Criteria

Factor	Consideration
Diacritic handling	Must preserve all tone marks
Romanized input	Handle input without diacritics
Legal terminology	Sino-Vietnamese roots common
Conversational tone	Balance formal and accessible

Linguistic Considerations

Sentence Structure

Vietnamese follows Subject-Verb-Object (SVO) order but with important differences:

Feature	Vietnamese Pattern	Example
Pro-drop	Pronouns often omitted	"(Tôi) cần luật sư" = "(I) need lawyer"
Kinship pronouns	Used instead of I/you	"Con" (child), "Anh" (older brother)
Classifier system	Required for nouns	"Một người luật sư" (one person lawyer)

Formality Registers

Register	Markers	Use Context
Formal	"Quý khách", "Ngài"	Legal documents, court
Polite	"Anh/Chị", "Ông/Bà"	General interactions
Familiar	"Em", "Con"	Close relationships

Chatbot Default: Use polite register with appropriate kinship terms based on context.

Legal Terminology

English	Vietnamese	Notes
Immigration	Di trú	Sino-Vietnamese root
Deportation	Trục xuất	Formal legal term
Asylum	Tị nạn	Historical resonance
Green Card	Thẻ xanh	Loan translation
Lawyer	Luật sư	Sino-Vietnamese
Court	Tòa án	Sino-Vietnamese

Trauma-Informed Design

Historical Context

Factor	Impact on Design
Refugee history	Deep distrust of government data collection
Communist regime experience	Suspicion of surveillance
Family separation trauma	Sensitivity around detention topics
Generational trauma	1975 fall of Saigon still resonant

Design Principles

Principle	Implementation
Explicit privacy assurance	Clear statements about data non-sharing
Independence emphasis	Highlight separation from government
No data persistence	Ephemeral conversations when possible
Calm visual design	Avoid alarming colors, urgent language
Empathetic tone	Acknowledge difficulty of situations

Trigger Avoidance

Topic	Sensitive Handling
Detention	Frame as "temporary holding," provide resources
Deportation	Emphasize rights, legal options
Government contact	Clarify difference from enforcement
Family separation	Acknowledge emotional difficulty

Sample Trauma-Informed Prompts

System Prompt:
Bạn là trợ lý thông tin pháp lý về di trú. Hãy trả lời bằng giọng điệu bình tĩnh,
đồng cảm. Luôn nhấn mạnh rằng thông tin này hoàn toàn riêng tư và không chia sẻ
với bất kỳ cơ quan chính phủ nào. Khi thảo luận về các chủ đề nhạy cảm như giam
giữ hoặc trục xuất, hãy nhấn mạnh quyền của người dùng và các nguồn hỗ trợ có sẵn.

Translation:
You are an immigration legal information assistant. Respond in a calm, empathetic tone.
Always emphasize that this information is completely private and not shared with any
government agency. When discussing sensitive topics like detention or deportation,
emphasize the user's rights and available support resources.

Regional Dialects

Dialectal Variation

Dialect	Region	Key Differences
Northern	Hanoi area	Standard pronunciation, formal
Central	Hue, Da Nang	Distinct vocabulary, intonation
Southern	Ho Chi Minh City	Merged tones, different vocabulary

US Vietnamese Population

Background	Concentration	Notes
Southern (pre-1975)	Orange County, San Jose	Older refugees
Post-1975 migrants	Mixed regions	Diverse backgrounds
Recent arrivals	Scattered	Different political context

Best Practice: Default to Northern standard for written text while accepting all dialectal input.

Community Outreach

Platform Preferences

Platform	User Base	Engagement
Zalo	Primary for older Vietnamese	Messenger, news, services
Facebook	Broad demographic	Groups, pages
YouTube	Information seeking	Vietnamese-language content
WhatsApp	Younger, tech-savvy	International connections

Zalo Integration

Consideration	Assessment
Reach	Dominant among Vietnamese diaspora
Official Accounts	Business/organization profiles available
Chatbot capability	API available for automation
Privacy	Vietnam-based company, consider implications

Trusted Intermediaries

Organization Type	Role	Examples
Buddhist temples	Spiritual community, trust	Local Chùa (temples)
Catholic parishes	Large Catholic population	Vietnamese Catholic communities
Community centers	Service provision	Vietnamese Community of [City]
Professional associations	Credibility	Vietnamese American Bar Association

Geographic Concentrations

Location	Population	Key Organizations
Orange County, CA	200,000+	Little Saigon, BPSOS
San Jose, CA	130,000+	Vietnamese American Foundation
Houston, TX	90,000+	Boat People SOS Texas
DFW, TX	70,000+	Various community orgs
Seattle, WA	60,000+	Vietnamese community centers

RAG Configuration

Chunking Strategy

Consideration	Approach
Syllable-based words	Vietnamese words often multi-syllable with spaces
Pre-chunking normalization	Restore diacritics before embedding
Preserve compound terms	Legal terminology as units

Morphological Processing

Vietnamese Text → underthesea normalize → Word segment → Chunk → Embed

Stage	Tool	Purpose
Normalize	underthesea.text_normalize	Fix diacritics
Segment	underthesea.word_tokenize	Identify words
Chunk	Custom splitter	Respect word boundaries
Embed	Multilingual model	Vector representation

Quality Assurance

Testing Requirements

Test Type	Participants	Focus
Diacritic handling	Automated	Input without marks
Tone disambiguation	Human review	Context interpretation
Trauma sensitivity	Community reviewers	Language around sensitive topics
Generational testing	Mixed ages	Accessibility across generations

Common Failure Modes

Issue	Detection	Resolution
Diacritic errors	Automated spell check	Improve normalization
Overly formal tone	Community feedback	Adjust prompt persona
Insensitive language	Community review	Content revision
Northern dialect bias	Southern speaker testing	Expand training data

Implementation Checklist

Phase 1: Foundation

[ ] Set up underthesea preprocessing
[ ] Configure diacritic restoration
[ ] Select base model (Qwen3 or Llama 3.1)
[ ] Create legal terminology dictionary
[ ] Design trauma-informed prompts

Phase 2: Integration

[ ] Build RAG with Vietnamese embeddings
[ ] Test diacritic-free input handling
[ ] Configure appropriate fonts (Noto Sans)
[ ] Implement calm, empathetic UI

Phase 3: Community Validation

[ ] Partner with community organizations
[ ] Test with diverse age groups
[ ] Validate trauma-informed approach
[ ] Buddhist/Catholic community review

Phase 4: Deployment

[ ] Pilot in Little Saigon (Orange County)
[ ] Zalo integration if appropriate
[ ] Monitor feedback channels
[ ] Iterate based on community input

Next Steps

Set up translation workflow for Vietnamese content
Design multilingual UX with diacritic support
Review community context for cultural considerations
Plan full implementation across all languages