Unit 3: Morphophonology

Learning Objectives

Explain what morphophonology studies and why surface forms differ from underlying morpheme sequences
Identify examples of phonologically conditioned and morphologically conditioned allomorphy
Describe suppletive forms and distinguish them from regular allomorphy
Analyse how writing systems reflect (or fail to reflect) phonological processes

Reading

Read Chapter 3 (Morphophonology) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.

Core Input

Read through each tab. Take notes on the key ideas before moving to the activities below.

Morphophonology is the subfield of linguistics that studies the relationship between morphological structure and phonological form. It sits at the interface between morphology (the study of word structure) and phonology (the study of sound systems), and it describes how the surface realisation of a morpheme varies depending on its phonological or morphological environment.

A fundamental distinction in morphophonology is between:

Underlying forms — abstract representations of morphemes, stored in the mental lexicon. The underlying form is the "canonical" version of the morpheme, before phonological rules apply.
Surface forms — the actual phonological sequences that are pronounced. These are the result of applying phonological rules to the underlying form in a particular context.

The English plural illustrates this neatly. There is a single underlying plural morpheme (often represented as /z/). Phonological rules map this underlying form to three distinct surface realisations:

cats → plural pronounced /s/ (after voiceless consonant /t/)
dogs → plural pronounced /z/ (after voiced consonant /g/)
horses → plural pronounced /ɪz/ (after sibilant /s/ or similar)

One morpheme; three surface forms. A linguist recognises the unity of the plural morpheme at the underlying level, even though its surface realisations vary.

NLP implication: NLP systems that process text see only surface forms. They see cats, dogs, and horses — and must infer that the -s/-z/-iz variation all represents the same underlying morpheme. Morphological analysers that build on phonological knowledge can generalise across these surface variants; purely data-driven systems must learn the generalisation implicitly from examples.

Allomorphy is the phenomenon in which a single underlying morpheme has multiple phonological realisations — its allomorphs — that depend on context. Allomorphy is the empirical consequence of the underlying/surface distinction.

Two main types of conditioning are distinguished:

Phonologically conditioned allomorphy — the allomorph chosen is determined by the phonological properties of the surrounding sounds. The English plural allomorphs (/s/, /z/, /ɪz/) are conditioned by the voicing and manner of articulation of the preceding consonant. The English indefinite article a/an is another example: a cat but an orange — the choice is conditioned by whether the following word begins with a consonant or a vowel.
Morphologically conditioned allomorphy — the allomorph chosen is determined not by phonology, but by the particular morphological context (typically, which lexical item the affix attaches to). English past tense: regular -ed applies by default, but certain verbs have idiosyncratic allomorphs: go → went, be → was/were. The choice of allomorph is conditioned by the specific verb root, not by any phonological property of that root.

Suppletion is the extreme end of morphologically conditioned allomorphy: the allomorph is so phonologically distant from the base that no rule could derive one from the other. Suppletive pairs must be stored in the lexicon rather than computed.

NLP implication: phonologically conditioned allomorphy can, in principle, be modelled with phonological rules (if the system has access to phonological representations). Morphologically conditioned allomorphy and suppletion require lexical storage. A morphological analyser must integrate both rule-based and lexical components to handle all cases correctly.

Writing systems vary in how faithfully they reflect the phonological structure of their language. The key dimension is orthographic depth:

Shallow orthographies have a close, consistent mapping between graphemes (written symbols) and phonemes (sounds). Finnish and Spanish are often cited as examples: spelling reliably predicts pronunciation and vice versa.
Deep orthographies have an imperfect, inconsistent mapping. English is notorious for this: through, tough, though, cough all share the -ough spelling but are pronounced differently. The spelling reflects historical pronunciation, etymology, and borrowing history — not the current sound system.

Importantly for morphophonology, writing systems tend to preserve morphological identity rather than track phonological variation. English writes the plural as -s regardless of whether it is pronounced /s/, /z/, or /ɪz/. This is a morphological consistency at the expense of phonological accuracy.

Other writing systems obscure phonological information in different ways: Arabic standard script omits short vowels; French liaison (a sound that appears at word boundaries before vowels) is not represented in standard orthography.

NLP implication: because NLP systems typically process text (not speech), they see spellings, not sounds. The phonological alternations that motivate allomorphic variation are therefore invisible to a text-based system. This creates a systematic gap between what the system can observe and what the underlying linguistic structure actually is. For speech-based NLP tasks — speech recognition, text-to-speech — phonological modelling must be done explicitly.

Key Concepts A: Morphophonology and Allomorphy (#23–#25)

Expand each concept. Consider the NLP implication before reading the explanation.

Morphophonology provides the systematic account of why a single underlying morpheme is realised differently in different phonological environments. It is the set of rules (or constraints) that map underlying representations to surface forms.

The canonical case: English plural. The underlying plural morpheme is abstractly a single entity. The surface forms /s/, /z/, and /ɪz/ are all derived from it by phonological rules:

Insert /ɪ/ before the plural /z/ when the preceding consonant is a sibilant (to avoid two sibilants in sequence): horse + z → horse + ɪz → horses
Devoice /z/ → /s/ after a voiceless consonant: cat + z → cat + s → cats
The underlying /z/ surfaces unchanged elsewhere: dog + z → dogs

The morphophonological generalisation is this: the variation is regular and predictable. Given the phonological context, you can always derive the correct surface form from the underlying one by rule.

NLP implication: a morphological analyser that encodes morphophonological rules can generalise to new words it has never seen before — it can predict that the plural of a novel word blick must be pronounced blicks (/s/), not blickz. A purely data-driven model must learn these patterns statistically from observed forms; it will generalise well for frequent patterns but may fail for rare phonological contexts.

Phonologically conditioned allomorphy occurs when the surface form of a morpheme is determined by the phonological properties of adjacent morphemes — their voicing, place of articulation, syllable structure, or vowel quality.

English indefinite article:

a cat, a dog, a university — /ə/ before consonant-initial words
an apple, an hour — /ən/ before vowel-initial words (note: an hour because h is silent)

The choice of allomorph (a vs an) is entirely determined by the phonological context — the initial sound of the following word.

Turkish vowel harmony — a pervasive phonological process in which suffix vowels must harmonise with the vowel of the root in terms of frontness and rounding:

Root	Gloss	Root vowel	Plural suffix	Plural form
ev	house	e (front)	-ler	evler
araba	car	a (back)	-lar	arabalar
göl	lake	ö (front, rounded)	-ler	göller
kol	arm	o (back, rounded)	-lar	kollar

The suffix vowel is not a fixed form; it is determined by the phonological properties of the root. This is phonologically conditioned allomorphy at scale: every suffix in Turkish is subject to vowel harmony.

NLP implication: a morphological analyser for Turkish must either encode vowel harmony rules or learn them from data. A simple lookup-based approach will fail on novel roots, because the correct suffix form depends on the root's vowel quality — information that the analyser must be sensitive to.

Morphologically conditioned allomorphy occurs when the choice of allomorph is determined not by phonological properties, but by the particular morphological environment — specifically, which lexical item (root or word class) the affix is attached to.

English past tense: the default past tense suffix is -ed (regular verbs: walk → walked, jump → jumped). But certain verbs have allomorphs conditioned by the specific lexical item:

Verb	Past tense	Type
walk	walked	Regular: -ed
go	went	Suppletive (see #26)
hold	held	Morphologically conditioned vowel change
ring	rang	Morphologically conditioned ablaut
be	was/were	Morphologically conditioned (with person/number split)

The allomorphs held and rang cannot be predicted from phonological properties of the roots: there is no phonological rule that turns hold into held or ring into rang. The allomorph is specific to the lexical item.

German plural shows morphological conditioning across noun classes: Hund → Hunde (dog), Kind → Kinder (child), Auto → Autos (car), Mutter → Mütter (mother). Each noun belongs to a class that determines which plural allomorph it takes; the class is a morphological (not phonological) property.

NLP implication: morphologically conditioned allomorphy cannot be computed by phonological rule; it requires lexical information. NLP systems must store the class membership or irregular forms of individual lexical items, in addition to whatever regular rules they encode.

Key Concepts B: Suppletion and Writing Systems (#26–#27)

Continue with the remaining core concepts for this unit.

Suppletion is the extreme case of morphologically conditioned allomorphy: the allomorph for a particular morphological combination is phonologically unrelated to the base form. It shares no phonological material with the stem and cannot be derived from it by any rule. Suppletive forms must be stored as lexical irregularities.

English suppletive paradigms:

Lexical item	Base	Suppletive form(s)	Function
go	go	went	Past tense
good	good	better, best	Comparative, superlative
bad	bad	worse, worst	Comparative, superlative
be	be	am, is, are, was, were	Person/number/tense agreement

Latin provides classic examples: fero (I carry) → past tense tuli (I carried). The two forms share no phonological material; the past is entirely suppletive.

Suppletion is found cross-linguistically, particularly in high-frequency vocabulary — the verbs go, be, and have, and adjectives like good and bad, are suppletive in many languages. This is not a coincidence: high frequency preserves irregular forms because speakers encounter them often enough to memorise them; rare items regularise over time.

NLP implication: suppletive forms cannot be generated by rule; they must appear in a lexicon. A morphological generator that produces past tenses by appending -ed will generate *goed instead of went. Language models learn suppletive paradigms implicitly from data, and they typically do so successfully for common suppletive forms — but they may overgeneralise to regularised forms for rare or novel lexical items.

Writing systems encode language in a particular way, and their relationship to the spoken language is never perfect. Alphabetic and syllabic scripts tend to reflect the broad outlines of a language's phonology, but they systematically fail to represent certain phonological processes — particularly morphophonological alternations.

English: spelling preserves morphological identity over phonological accuracy. The plural is always written with -s, regardless of whether it is pronounced /s/, /z/, or /ɪz/. The past tense is always written with -ed, regardless of whether it is pronounced /t/ (as in walked), /d/ (as in jogged), or /ɪd/ (as in wanted). The phonological alternation is hidden by the spelling.

French liaison: in spoken French, a final consonant that is normally silent is pronounced when it precedes a word beginning with a vowel. les amis ("the friends") is pronounced as if the s of les links to amis — but this is not represented in standard orthography.

Arabic: standard written Arabic (Modern Standard Arabic) omits short vowels. Since vowel patterns are a central component of Arabic morphology (#8), the written form underspecifies morphological structure. A human reader with knowledge of Arabic grammar resolves the ambiguity from context; an NLP system must do the same.

Japanese kanji: Chinese characters borrowed into Japanese (kanji) have multiple readings (on-yomi from Chinese, kun-yomi native Japanese) that reflect historical phonological layers, not the current phonological system.

NLP implication: Grapheme-to-phoneme (G2P) conversion — converting written text to pronunciation — is non-trivial in deep orthographies like English because the mapping is not consistent. A G2P system for English must handle a large number of irregularities and context-dependent rules. For shallow orthographies (Finnish, Spanish), G2P is much simpler. The choice of language therefore determines what phonological processing is required.

Worked Examples: Morphophonology in Action

Work through these examples to see how morphophonological processes operate in several languages, and consider the challenges they present for NLP.

English plural allomorphy provides the clearest illustration of phonologically conditioned allomorphy: three surface forms of one underlying morpheme.

Surface spelling	Pronunciation	Conditioning environment	Examples
-s	/s/	After voiceless non-sibilant consonants	cats, cups, rocks, graphs
-s	/z/	After voiced consonants and vowels	dogs, beds, cans, days
-es	/ɪz/	After sibilants (/s z ʃ ʒ tʃ dʒ/)	horses, judges, churches, mazes

Crucially, the spelling does not reflect the alternation: all three are written with -s (or -es), obscuring the phonological variation (#27).

NLP implication: a phonology-aware morphological analyser can predict the correct pronunciation of any novel plural from phonological rules alone (applying #23 and #24). A text-based system has no access to this alternation unless it is explicitly modelled — it sees only the orthographic -s and must learn the phonological implication indirectly.

Turkish vowel harmony is a systematic phonological process in which suffix vowels must harmonise with the root vowel in two dimensions: frontness (front/back) and rounding (rounded/unrounded).

The plural suffix is either -ler (after front vowels) or -lar (after back vowels). Other suffixes follow the same pattern with their own vowels.

ev-ler

house-PL (root vowel e = front, unrounded → suffix -ler)

“Houses”

araba-lar

car-PL (root vowel a = back, unrounded → suffix -lar)

“Cars”

göl-ler

lake-PL (root vowel ö = front, rounded → suffix -ler)

“Lakes”

kol-lar

arm-PL (root vowel o = back, rounded → suffix -lar)

“Arms”

NLP implication: an NLP system that treats the suffix as a fixed string will misanalyse Turkish morphology. Both -ler and -lar (and their counterparts in other suffix slots) are allomorphs of the same underlying morpheme. A machine translation system must understand vowel harmony to generate correctly inflected Turkish target forms (#24).

Suppletive paradigms in English and other languages cannot be derived by any phonological or morphological rule — they must be memorised and stored lexically.

Language	Base	Suppletive form	Function
English	go	went	Past tense
English	good	better / best	Comparative / superlative
English	be	am / is / are / was / were	Person, number, tense
Latin	bonus (good)	melior / optimus	Comparative / superlative
Hindi	acchā (good)	behtar (better)	Comparative (borrowed from Persian)

Note that the Latin and English comparatives of "good" are suppletive in both languages — and they are etymologically related (Latin melior and English better both trace back to Proto-Indo-European roots, though via different paths). Suppletion in high-frequency items is often preserved across languages in the same family.

NLP implication: morphological generators must include a lexical exception list for suppletive forms. Language models learn suppletive paradigms from data; because suppletive forms like went and better are very frequent in training text, LLMs typically handle them well. However, for low-frequency verbs with irregular forms, or for morphologically rich languages with many suppletive paradigms, data coverage may be insufficient (#26).

Writing systems and pronunciation have a complex, language-specific relationship. Spelling is not pronunciation — and the gap between them is different in every language.

English — deep orthography:

Written	Pronounced	Notes
through	/θruː/	Same spelling -ough, four different pronunciations
tough	/tʌf/
though	/ðəʊ/
cough	/kɒf/

Written past tense	Pronunciation	Rule
walked (-ed)	/wɔːkt/ (/t/)	After voiceless consonant
jogged (-ed)	/dʒɒgd/ (/d/)	After voiced consonant
wanted (-ed)	/wɒntɪd/ (/ɪd/)	After alveolar stop

Arabic — vowel-sparse script:

Standard Arabic script represents consonants and long vowels, but omits short vowels. The written form ktb (representing the root k-t-b) could correspond to several different word forms. A fully vowelled form (kataba, kitāb) removes this ambiguity, but most running text lacks full vowelling.

NLP implication: grapheme-to-phoneme (G2P) conversion requires language-specific rules for every one of these cases. A general-purpose G2P system cannot handle the range of cross-linguistic variation. Text-to-speech systems for deep orthographies like English must either memorise pronunciations or learn highly complex context-dependent rules (#27).

Check Your Understanding

Select the best answer for each question.

English plurals are spelled with -s in all cases (cats, dogs, horses). Which concept explains why the spoken forms differ despite the uniform spelling?

#23 — surface forms are related to underlying abstract morpheme sequences #24 — the form of a morpheme can be sensitive to its phonological context #26 — suppletive forms replace a stem+affix combination with a wholly different word #27 — alphabetic and syllabic writing systems reflect some but not all phonological processes

The English verb 'go' has past tense 'went', which shares no phonological material with 'go'. Which concept describes this?

#23 — the morphophonology describes how surface forms relate to underlying morpheme sequences #24 — the form of a morpheme can be sensitive to its phonological context #25 — the form of a morpheme can be sensitive to its morphological context #26 — suppletive forms replace a stem+affix combination with a wholly different word

AI Dimension

NLP systems — including large language models — are trained primarily on written text, not on spoken language. This creates a systematic gap between what they learn and what the underlying linguistic structure actually is. Four issues follow directly from this unit:

Speech recognition and allomorphy — automatic speech recognition (ASR) systems must model phonological variation in morpheme forms. The plural morpheme, the past tense morpheme, and agreement morphemes all have multiple allomorphs (#24). A system that learns from audio must learn to map these distinct phonological surface forms to the same underlying morpheme — a non-trivial task complicated by accent and dialect variation.
Text-to-speech and grapheme-to-phoneme conversion — generating natural speech from text requires converting spellings to phonological representations. For deep orthographies like English, this conversion is far from trivial (#27). English G2P systems must handle thousands of irregular correspondences. For shallow orthographies (Finnish, Spanish), the task is simpler, and systems trained on one language cannot be transferred directly to the other without language-specific phonological knowledge.
LLMs and surface form bias — LLMs are trained on written text and learn surface orthographic patterns, not underlying phonological representations. They see the uniform -s spelling of English plurals and must infer the phonological alternation indirectly from context. This works well for text generation tasks, but it means that LLMs have no direct model of the spoken language — a serious limitation for speech-based applications.
Dialect and accent variation — allomorphic variants differ by dialect. In some dialects of English, vowel harmony-like processes exist (e.g. in some Northern British dialects, the definite article has an allomorph before vowels). NLP systems trained on standard written English may fail on text that reflects non-standard pronunciations, because the surface realisations of morphemes differ from the standard allomorphs the system has learned (#23, #24).

The deepest lesson from morphophonology for AI is that written language is an imperfect proxy for the underlying linguistic system. Systems that treat orthographic form as the ground truth will systematically miss the phonological structure that motivates the patterns they observe.

Activities

Individual task — Allomorph classification

For each of the following English words, identify the surface form of the plural/past tense morpheme and answer the questions below:

brushes / leapt / mice / walked / oxen / begged

State the surface form of the plural or past tense morpheme in each word (write the phonological form, not just the spelling — e.g. /s/, /z/, /ɪz/ for plurals; /t/, /d/, /ɪd/ for past tense).
Classify each allomorph as phonologically conditioned (can be derived by rule from the phonological context) or morphologically conditioned (specific to the lexical item, not predictable from phonology).
For any morphologically conditioned allomorph, state whether a rule can derive the surface form or whether lexical storage is required. Where suppletion applies (#26), identify it explicitly.

Pair task — Turkish vowel harmony investigation

Using an online Turkish dictionary or language resource, find three examples of Turkish nouns with their plural forms, choosing roots with different vowel qualities (aim for at least one front-vowel root and one back-vowel root).

For each example:

State the root form and its vowel(s), and identify the plural suffix allomorph used (-ler or -lar).
Write out the underlying suffix form (the abstract morpheme) and the surface form (what actually appears), following the format used in Activity 4.
Discuss: how would a machine translation system that translates from English to Turkish need to handle vowel harmony? What information does it need about the noun root, and when in the generation process does it need that information? Draw on concept #24.

Group task — Writing systems and phonological depth

As a group, compare how three writing systems relate to their phonology. Assign one language per group member (or choose your own):

English (alphabetic, deep orthography)
Finnish (alphabetic, shallow orthography)
Arabic (abjad — consonants only in standard script; short vowels omitted)

For your assigned language:

Classify the orthography as shallow (close grapheme–phoneme correspondence) or deep (imperfect correspondence with many irregularities). Provide two specific examples of the correspondence (or its failure).
Describe what this means for a grapheme-to-phoneme converter for that language: is it a simple lookup, a rule-based system, or something more complex?
Link your analysis to concept #27 and, where relevant, to the writing systems table from Unit 1.

After each member has presented their language, discuss as a group: which orthographic type presents the greatest challenge for NLP applications that bridge text and speech?

Review

#23 — Morphophonology describes how surface forms are related to underlying abstract sequences of morphemes; the same underlying morpheme may have multiple surface realisations.
#24 — The form of a morpheme can be sensitive to its phonological context (phonologically conditioned allomorphy); e.g. English plural /s~z~ɪz/, Turkish vowel harmony.
#25 — The form of a morpheme can be sensitive to its morphological context (morphologically conditioned allomorphy); e.g. English past tense ring → rang, German noun plural classes.
#26 — Suppletive forms replace a stem+affix combination with a wholly different word (e.g. go → went, good → better); suppletion requires lexical storage, not rule computation.
#27 — Alphabetic and syllabic writing systems reflect some but not all phonological processes; English spelling preserves morphological identity at the expense of phonological accuracy.

Allomorphy is the phenomenon in which a single underlying morpheme is realised as two or more distinct surface forms (allomorphs), depending on context. It is the empirical signature of the distinction between underlying and surface representations.

Phonologically conditioned allomorphy — the allomorph is determined by the surrounding sounds; it follows a rule and can be computed from phonological properties. Examples: English plural /s~z~ɪz/; English indefinite article a~an; Turkish vowel harmony.
Morphologically conditioned allomorphy — the allomorph is determined by the specific lexical item; it cannot be computed from phonology and must be stored. Examples: English ring → rang; German noun plural classes.
Suppletion — the extreme case: the allomorph is phonologically unrelated to the base. Examples: English go → went; good → better; Latin fero → tuli.
Writing systems hide alternations — orthographic uniformity (e.g. always spelling the plural as -s) conceals the phonological variation from text-based NLP systems. Speech processing requires phonological knowledge that text processing does not.

For NLP, the practical consequence is a three-way architecture requirement: phonological rules (for phonologically conditioned allomorphy), lexical storage (for morphologically conditioned allomorphy and suppletion), and orthographic models (for the gap between spelling and sound).

Proceed to Unit 4: Syntax when ready.

Unit 3: Morphophonology

Learning Objectives

Core Input

Key Concepts A: Morphophonology and Allomorphy (#23–#25)

#23 — The morphophonology of a language describes the way in which surface forms are related to underlying, abstract sequences of morphemes.

#24 — The form of a morpheme (root or affix) can be sensitive to its phonological context.

#25 — The form of a morpheme (root or affix) can be sensitive to its morphological context.

Key Concepts B: Suppletion and Writing Systems (#26–#27)

#26 — Suppletive forms replace a stem+affix combination with a wholly different word.

#27 — Alphabetic and syllabic writing systems tend to reflect some but not all phonological processes.

Worked Examples: Morphophonology in Action

Check Your Understanding

Activities

Review

Summary: What are the five core concepts from this unit (#23–#27)?

Summary: What is allomorphy and why does it matter for NLP?