Unit 2: Morphology
Learning Objectives
- Define morpheme and identify examples of free, bound, inflectional, and derivational morphemes
- Explain how non-contiguous, null, and suprasegmental morpheme forms challenge NLP tokenisation
- Distinguish between isolating, agglutinative, and fusional languages and describe their NLP implications
- Analyse morphological ambiguity and its consequences for automatic text processing
Reading
Read Chapter 2 (Morphology) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.
Core Input
Read through each tab. Take notes on the key ideas before moving to the activities below.
A morpheme is the smallest meaningful unit of language — a pairing of form (a sequence of sounds or phones) with meaning. Words are built from morphemes, and understanding this internal structure is essential for many NLP tasks.
Morphemes are classified by their distribution:
- Free morphemes can stand alone as words: cat, run, happy. They carry core lexical meaning.
- Bound morphemes must attach to another morpheme: -s (plural), un- (negation), -ing (progressive). They cannot occur independently.
Morphemes are not always the neat sequences of phonemes that English examples suggest. The form of a morpheme can be non-contiguous (spread across a word, as in Arabic roots), suprasegmental (expressed through tone, as in Mandarin), or even null (a zero form that nonetheless signals grammatical information, as in English sheep/sheep for the plural). These properties are covered in the concepts below.
NLP implication: a tokeniser that simply splits on whitespace treats each space-separated string as an atomic unit. It misses the internal structure of words — and therefore misses grammatically and semantically relevant distinctions encoded in that structure. Morphological analysis is often a prerequisite for downstream tasks such as parsing, machine translation, and information extraction.
Morphemes that attach to roots can be classified into two major functional types:
Derivational morphemes (concepts #11–#13) change the lexical meaning of a word, and often its grammatical category:
- happy (adjective) → happiness (noun) via -ness
- teach (verb) → teacher (noun) via -er
- possible (adjective) → impossible (adjective) via im-
Derivationally complex words can have idiosyncratic meanings — the meaning of blackbird cannot be fully predicted from black + bird (#13). NLP systems that attempt to decompose such words compositionally will generate incorrect representations.
Inflectional morphemes (concept #14) add grammatically or semantically relevant features without changing the core lexical meaning or category:
- English -ed marks past tense: walked, jumped
- English -s marks third person singular present: she walks
- Spanish hablo / hablas / habla — verb agrees with subject in person and number
NLP implication: Lemmatisation (reducing a word to its dictionary form) requires stripping inflectional affixes: walked → walk. Stemming (cruder, rule-based reduction) strips affixes mechanically. Both tasks require an implicit model of inflectional morphology. Errors in lemmatisation propagate to every downstream task that uses lemmas as features.
Languages vary dramatically in how much morphological information they pack into a single word (#20), and how that information is structured (#21, #22). Four broad morphological types are commonly recognised:
- Isolating languages (e.g. Mandarin Chinese, Thai) — words tend to consist of a single morpheme; grammatical information is carried by separate words or word order. NLP challenge: word segmentation (no whitespace), but morphology itself is relatively simple.
- Agglutinative languages (e.g. Turkish, Finnish, Swahili) — words are built from sequences of morphemes with clear boundaries; each affix encodes one grammatical feature. NLP challenge: a single word may encode what English expresses as a full clause; fixed-vocabulary models face severe sparsity.
- Fusional languages (e.g. Russian, Latin, Arabic) — affixes fuse multiple grammatical features into a single form; boundaries between morphemes are often blurred. NLP challenge: morpheme segmentation is non-trivial; the same form may encode several features simultaneously.
- Polysynthetic languages (e.g. Inuktitut) — a single word may incorporate what would be a full sentence in other languages; morpheme-per-word counts can exceed ten. NLP challenge: existing tokenisation and vocabulary strategies are essentially inapplicable.
Overarching NLP lesson: NLP tools designed for English — a mildly fusional, moderately analytic language — do not transfer without modification to highly agglutinative or polysynthetic languages. Typological awareness is a prerequisite for multilingual NLP.
Key Concepts A: Morphemes (#7–#15)
Expand each concept. Consider the NLP implication before reading the explanation.
A morpheme is a form–meaning pairing. The word cats contains two morphemes: cat (the animal) and -s (plurality). Neither can be divided further without destroying the form–meaning relationship.
Note the qualification usually: the definition of morpheme accommodates edge cases (non-contiguous roots, null forms, tonal contrasts) that the simple phone-sequence prototype does not cover — these are addressed in #8–#10.
NLP implication: NLP systems that treat cats and cat as completely unrelated tokens miss the systematic relationship encoded by morphology. Recognising morpheme boundaries is the first step toward lemmatisation, morphological tagging, and linguistically informed feature extraction.
In some languages, a single morpheme is realised as a discontinuous sequence of sounds distributed across a word. The classic example is the Arabic triliteral root system.
The root k-t-b carries the abstract meaning "write/writing". Vowels (and sometimes consonants) are interleaved to create distinct word forms:
| Form | Gloss | Root consonants |
|---|---|---|
| kataba | he wrote | k_t_b |
| kitāb | book | k_t_b |
| kātib | writer | k_t_b |
| maktaba | library | _kt_b_ |
NLP implication: morphological analysers designed for concatenative languages (where morphemes are simple linear sequences) cannot handle non-contiguous roots. Segmenting Arabic requires a fundamentally different architecture that recognises the root–pattern template system.
In some languages, morphological distinctions are encoded not in a sequence of phones but in suprasegmental features such as tone. Mandarin Chinese provides the canonical example: the same segmental sequence ma has four distinct lexical meanings depending on which of the four tones is used:
| Form | Tone | Meaning |
|---|---|---|
| mā | Tone 1 (high-level) | mother |
| má | Tone 2 (rising) | hemp |
| mǎ | Tone 3 (dipping) | horse |
| mà | Tone 4 (falling) | scold |
Here the tonal contour is itself a morpheme (or rather, a component of lexical form) — and it is not a sequence of phones.
NLP implication: text-based NLP for Mandarin typically uses characters, and tonal diacritics (pīnyīn romanisation) are rarely present in natural running text. The tonal distinction is therefore usually invisible to standard text-processing pipelines. Speech processing requires explicit tonal modelling.
A zero morpheme (or null morpheme) is a morpheme with no phonological form — it encodes grammatical information, but nothing is pronounced or written in its place.
English examples:
- sheep (singular) / sheep (plural) — the plural is marked by a zero morpheme, not by the addition of -s. The plural meaning is present; the phonological form is absent.
- I run — English first-person singular present tense has no agreement suffix (contrast third-person she runs). The absence of a suffix is the zero morpheme.
NLP implication: a system that learns morphological patterns from observed forms will not learn zero morphemes — there is nothing to observe. Yet the grammatical information they encode must still be recovered. Rule-based morphological analysers handle this explicitly; neural systems must learn it implicitly from contextual patterns.
The root is the morpheme that carries the central, irreducible lexical content of a word. All other morphemes attach to it (or are interleaved with it, in non-concatenative systems).
Examples: cat in cat / cats / catfish / catlike; Turkish git- (go) in gidiyorum / gitmedim / gitmek.
NLP implication: identifying the root of a word is the goal of stemming — reducing all surface forms of a word to a common representation. Stemming is used in information retrieval to improve recall: a search for run should also retrieve documents containing runs and running.
Derivational morphology creates new lexical items from existing ones. It frequently changes the grammatical category of the base, as well as its meaning:
| Base | Category | Derived form | Category | Affix |
|---|---|---|---|---|
| happy | Adjective | happiness | Noun | -ness |
| teach | Verb | teacher | Noun | -er |
| possible | Adjective | impossible | Adjective | im- |
NLP implication: part-of-speech tagging and lemmatisation must model derivational morphology. A word-form lookup approach will fail for novel derivationally complex forms, which can in principle be created productively (a speaker can coin unhappifiable and be understood). Morphological analysers must generalise across productive derivational patterns.
Even when both the root and the derivational affix are known, the meaning of the derived word may not be compositionally predictable:
- blackbird ≠ a bird that is black (it is a specific species, Turdus merula)
- hotdog ≠ a dog that is hot (it is a type of food)
- understanding ≠ the act of standing under something
These forms are lexicalised: their meaning has diverged from what strict compositional analysis would predict.
NLP implication: a purely compositional morphological analyser will generate wrong glosses for lexicalised compounds and derivatives. NLP systems need access to lexical entries for such forms, rather than relying on rule-based decomposition alone.
Inflectional morphology adds features such as tense, aspect, number, gender, case, person, or mood to an existing word, without creating a new lexical item. The resulting forms all share the same lemma and basic meaning.
English examples: walk / walks / walked / walking — all forms of the verb walk. Spanish: hablo / hablas / habla / hablamos / habláis / hablan — all forms of hablar (to speak), inflected for person and number.
Inflectional features are syntactically required by the grammar: a finite English verb must be inflected for tense; a Spanish verb must agree with its subject in person and number.
NLP implication: morphological tagging assigns inflectional feature values to word tokens. This is a prerequisite for syntactic parsing in morphologically rich languages, where inflectional information determines grammatical role. Lemmatisation must strip inflectional affixes correctly to group related forms under a single representation.
A single morpheme form can encode multiple, distinct grammatical meanings — it is ambiguous when the context is not sufficient to distinguish between them, and underspecified when it genuinely does not differentiate between possible values.
The English suffix -s is famously ambiguous across three distinct morphemes:
- cats — plural noun marker
- she walks — third person singular present tense
- John's cat — genitive/possessive marker
Mandarin Chinese provides a different type of underspecification: there is no tense morphology, so verbs do not indicate whether an event is past, present, or future. The temporal interpretation must be inferred from context.
NLP implication: morphological ambiguity must be resolved before many downstream tasks can proceed correctly. Part-of-speech disambiguation and morphological tagging perform this resolution using contextual information. Underspecification in the source language can cause problems for machine translation: translating from Mandarin to English requires choosing a tense for every verb — information that is simply absent from the source.
Key Concepts B: Words, Boundaries, and Typology (#16–#22)
Continue with the remaining core concepts for this unit.
In English, the word seems like a natural unit: it is a string of characters between spaces in the written language. But this orthographic convention does not reflect a universal linguistic reality.
- Turkish gidiyorum ("I am going") encodes subject, tense, aspect, and the verb in a single orthographic word — it could reasonably be analysed as a phrase.
- French je t'aime ("I love you") involves a clitic t' that some analyses treat as a separate word, others as a prefix on aime.
- Mandarin Chinese is written with no whitespace between characters; "word" boundaries must be inferred.
NLP implication: word tokenisation — the first step in virtually every NLP pipeline — is a non-trivial task in many languages. The unit of analysis (character, morpheme, word, phrase) must be chosen deliberately for each language and task.
Within a word, morpheme order is typically fixed by the grammar: you cannot permute the affixes on a Turkish verb without producing an ungrammatical or meaningless form. Between words in a sentence, ordering constraints are often more flexible (particularly in case-marking languages where role is indicated by the morpheme, not the position).
English example: adjective order within a noun phrase is relatively fixed (big old red Italian sports car — a specific ordering must be respected), yet adverbials can move more freely within a sentence. At the morpheme level, the English verbal suffix sequence is entirely fixed: - VOICE - TENSE - NUMBER is not permutable.
NLP implication: a language model that learns word-level co-occurrence patterns does not automatically learn morpheme-level ordering constraints within words. Sub-word models (BPE, WordPiece) approximate morpheme boundaries but do not enforce morphotactic constraints.
Over time, free words can become bound morphemes — a process called grammaticalisation. What starts as an independent word becomes phonologically reduced and grammatically dependent.
Examples:
- English -ly (adverb suffix in quickly, happily) derives from Old English -līce, itself from lic (meaning "body, form"). A free noun has become a bound morpheme over several centuries.
- English going to → gonna: a phrase is in the process of grammaticalising into a tense/aspect marker. In spoken English, this is already functionally a bound morpheme in informal registers.
NLP implication: NLP tools trained on contemporary data may behave differently on historical texts, where a form that is now a bound morpheme was still a free word. Historical NLP and digital humanities applications must account for diachronic morphological change.
A clitic occupies an intermediate position between a free morpheme (an independent word) and a bound morpheme (an affix). It has its own syntactic function and meaning, but it cannot bear stress and must attach phonologically to an adjacent word — its host.
English examples:
- 've in I've — a clitic form of have
- 's in John's — a clitic marking genitive (but also a form of is/has)
- n't in don't, can't — a clitic form of not
Spanish object clitics are another example: lo veo ("I see him/it") and véalo ("see him/it!") — the clitic lo attaches in different positions in declarative vs imperative contexts.
NLP implication: clitics challenge tokenisation. The form don't must be split into do + n't for correct morphological and syntactic analysis, but a simple whitespace tokeniser will treat it as a single token. Clitic handling is a standard preprocessing step in many NLP pipelines.
The morpheme-per-word ratio is a key typological parameter that directly determines the scale of vocabulary in a language's text:
| Language | Type | Avg. morphemes/word | Example |
|---|---|---|---|
| Mandarin Chinese | Isolating | ~1 | wǒ (I) — 1 morpheme |
| English | Fusional | ~1.5 | walked — 2 morphemes |
| Turkish | Agglutinative | ~4 | evlerinizden (from your houses) — 4 morphemes |
| Inuktitut | Polysynthetic | 10+ | Single words equivalent to English sentences |
Turkish evlerinizden = ev-ler-iniz-den: house-PL-2PL.POSS-ABL "from your houses"
NLP implication: a fixed-vocabulary model trained on Turkish text will encounter vast numbers of unseen word forms — because any verb or noun root can in principle combine with dozens of suffixes, producing thousands of distinct forms. Subword models (BPE, WordPiece) mitigate this by splitting unknown words into known subword units, but they approximate — not replicate — morphological structure.
Cross-linguistically, suffixing is far more common than prefixing as the dominant mode of affixation — this is one of the most robust typological universals. Most languages that use affixes at all attach them after the root.
Examples:
- Suffixing majority: Turkish, Finnish, Japanese, Swahili (for many categories)
- Prefixing tendency: many Bantu languages (e.g. Swahili verb prefixes for subject agreement: a-na-soma = 3SG-PRES-read, "he/she reads")
- Mixed: English uses both (prefix un-, suffix -ed); Arabic uses both prefixes and suffixes alongside its non-concatenative root system
NLP implication: morphological analysers must be configured for the directionality of the target language. A suffix-stripping stemmer designed for English will not work for a Bantu language where grammatical information is primarily prefixal.
The clarity of morpheme boundaries varies substantially across morphological types:
- Agglutinative (clear boundaries): Turkish ev-ler-im-den (house-PL-1SG.POSS-ABL) — each suffix encodes one feature; boundaries are unambiguous.
- Fusional (fused forms): Latin -at in amat ("he/she loves") — this single suffix encodes third person, singular, present tense, active voice, and indicative mood simultaneously. There is no way to separate these features into distinct segments.
- Non-concatenative (no linear boundaries): Arabic roots interleaved with vowel patterns — the "boundary" between root and pattern does not exist in linear sequence.
NLP implication: morpheme segmentation algorithms perform well on agglutinative languages but poorly on fusional and non-concatenative ones. BPE and other subword algorithms learn from data and therefore approximate the boundaries that are actually present — they work best when those boundaries are consistent and learnable.
Worked Examples: Morphology across Languages
Examine how morphological processes work in four typologically different languages, and consider the challenge each poses for NLP.
English morphology is a mixture of regular (rule-derivable) and irregular (lexically stored) forms. NLP must handle both.
| Process | Regular examples | Irregular examples |
|---|---|---|
| Past tense | walked, jumped, played | went (go), sang (sing), held (hold) |
| Plural | cats, dogs, horses | children (child), sheep (sheep), mice (mouse) |
| Comparative | taller, faster | better (good), worse (bad) |
NLP approach: regular forms can be handled by rule (or learned rule); irregular forms must be stored in a lexicon. Most production lemmatisers combine both strategies. A purely rule-based stemmer will generate wrong lemmas for irregular forms; a purely lookup-based system will fail on novel derived words.
Turkish is a prototypical agglutinative language. Suffixes are added in a fixed order, each encoding a single grammatical feature, with clear morpheme boundaries.
Example 1 — Negated past tense verb:
git-me-di-m
go-NEG-PAST-1SG
“I did not go.”
Example 2 — Noun with case and possessive:
ev-ler-im-den
house-PL-1SG.POSS-ABL
“From my houses.”
NLP implication: the open-ended combinatorial productivity of Turkish morphology means that vocabulary size is effectively unbounded (#20). A neural model with a fixed vocabulary of, say, 50,000 tokens will encounter vast numbers of out-of-vocabulary forms. BPE (Byte Pair Encoding) subword tokenisation handles this by splitting unknown words into known subword units — but BPE splits are learned statistically, not morphologically, so they approximate rather than replicate morpheme boundaries (#22).
Arabic morphology is non-concatenative: words are formed by interleaving a consonantal root with a vowel pattern, rather than attaching affixes linearly (#8).
Root d-r-s (related to study/teaching):
darasa
d_r_s + a-a-a pattern
“He studied.”
darrasa
d_rr_s + a-a-a pattern (with gemination)
“He taught (others).”
dirāsa
d_r_s + i-ā-a pattern
“Study (noun).”
madrasa
m_d_r_s_+ prefix ma- + a-a-a pattern
“School.”
Additional challenge: standard written Arabic typically omits short vowels. The root consonants appear, but the vowel pattern — which distinguishes the word forms above — is absent. A reader (or NLP system) must infer the correct form from context. This makes Arabic NLP especially sensitive to morphological ambiguity (#15).
Hindi marks grammatical gender (masculine/feminine) on nouns, adjectives, and verbs. Unlike biological sex, grammatical gender is a lexically assigned category that must be learned for each noun individually.
Adjective–noun agreement in gender:
baṛā larkā
big.MASC boy.MASC
“A big boy.”
baṛī larkī
big.FEM girl.FEM
“A big girl.”
The adjective baṛā/baṛī ("big") changes form to agree with the gender of the noun. Gender is lexically assigned to nouns — larkā (boy) is masculine, larkī (girl) is feminine — and agreement is obligatory throughout the noun phrase and into the verb phrase.
NLP implication: machine translation from English to Hindi must assign a grammatical gender to every noun — but English nouns have no grammatical gender. The MT system must infer gender from world knowledge or context, and generate correctly agreeing adjectives and verb forms. Gender errors are among the most common translation failures in English-to-Hindi MT.
Check Your Understanding
Select the best answer for each question.
A standard whitespace tokeniser splits text at spaces. For which language would this be most problematic as a first processing step?
A neural language model uses a fixed vocabulary and treats 'walk', 'walks', 'walked', and 'walking' as four separate tokens. Which concept best explains why this approach fails for morphologically rich languages like Turkish?
Large language models must tokenise text before processing it. The tokenisation strategy chosen — and the assumptions embedded in it — determines what morphological information the model can and cannot capture. Four failure patterns follow directly from this unit:
- Subword tokenisation as morphological approximation — BPE (Byte Pair Encoding) and WordPiece learn frequent subword units from training data. For English, this often aligns reasonably well with morpheme boundaries. For Turkish and Finnish (#20, #22), it frequently does not: the statistically learned splits diverge from linguistically meaningful ones, creating tokens that do not correspond to any morpheme.
- Vocabulary sparsity in morphologically rich languages — a fixed vocabulary of 50,000–100,000 tokens is adequate for English; for Turkish or Finnish, with morphologically open-ended word formation, the same vocabulary size leaves enormous numbers of valid word forms out-of-vocabulary (#20, #21). Models trained on such vocabularies systematically underrepresent morphologically rich languages.
- Hallucination from morphological complexity — when an LLM encounters a morphologically complex form it has not seen, it may confidently produce an incorrect morphological decomposition or an incorrect translation (#7, #15). Because morphological ambiguity is high (especially in vowel-sparse scripts like Arabic), the model lacks the contextual grounding to disambiguate reliably.
- Cross-linguistic portability — models trained primarily on English learn English morphological assumptions implicitly. Applying them to typologically different languages without modification reproduces the failures described in concept #6 (Unit 1): incorporating morphological knowledge improves cross-linguistic portability.
The practical lesson is not that neural models cannot handle morphology — they can, and do, remarkably well in well-resourced settings. The lesson is that knowing what they assume about morphological structure allows you to predict where they will fail, and to design better evaluation, preprocessing, and fine-tuning strategies.
Activities
Individual task — Morpheme segmentation and labelling
For each of the following English words, segment into morphemes and answer the questions below:
unbreakable / teachers / unhappiness / internationally
- Write out each morpheme separately and label it as free or bound.
- For each bound morpheme, state whether it is derivational or inflectional, and describe the grammatical or semantic function it adds.
- Check whether any of the root + affix combinations produce an idiosyncratic meaning that cannot be predicted compositionally (#13). Note any cases you find.
Compare your segmentation with a classmate's. If you disagree, discuss which analysis is better supported by the linguistic concepts from this unit.
Pair task — Subword tokenisation vs morphological segmentation
Using an online subword tokeniser (e.g. the Hugging Face tokeniser demo for a multilingual model), investigate how a BPE-based tokeniser handles the Turkish word:
evlerinizden (ev-ler-iniz-den: house-PL-2PL.POSS-ABL, "from your houses")
- Record the subword units that the tokeniser produces. Do they align with the morpheme boundaries ev / ler / iniz / den?
- Now try several related forms: evim (my house), evden (from the house), evlerimizden (from our houses). How consistent are the tokeniser's splits?
- Discuss: what are the consequences of mis-aligned subword splits for a machine translation system translating Turkish to English? Consider agreement, case, and possessive information. Draw on concepts #20 and #22.
Group task — Morphological typology comparison table
Each group member should take one of the following languages (or choose your own): Mandarin Chinese, Turkish, Finnish, Russian, or Swahili.
For your language, find or construct one clear morphological example and provide:
- The word form with a morpheme-by-morpheme segmentation and gloss (in the format: form / MORPH-LABEL-MORPH-LABEL / "translation")
- Identification of the morphological type: isolating, agglutinative, fusional, or polysynthetic
- One specific NLP challenge that arises from this language's morphological properties, drawing on at least one of concepts #20, #21, or #22
Compile your individual examples into a group comparison table. As a group, discuss: which morphological type is hardest for current NLP tools, and why?
Review
- #7 — Morphemes are the smallest meaningful units of language, consisting of a form–meaning pair.
- #8 — The phones making up a morpheme don't have to be contiguous (e.g. Arabic roots).
- #9 — The form of a morpheme doesn't have to consist of phones (e.g. tonal morphemes in Mandarin).
- #10 — The form of a morpheme can be null (e.g. zero plural in English sheep).
- #11 — Root morphemes convey core lexical meaning.
- #12 — Derivational affixes can change lexical meaning (and often grammatical category).
- #13 — Root + derivational affix combinations can have idiosyncratic, non-compositional meanings.
- #14 — Inflectional affixes add syntactically or semantically relevant features without changing the core lexical item.
- #15 — Morphemes can be ambiguous and/or underspecified in their meaning (e.g. English -s).
- #16 — The notion 'word' can be contentious in many languages (e.g. Mandarin, Turkish).
- #17 — Constraints on order operate differently between words than between morphemes.
- #18 — The distinction between words and morphemes is blurred by grammaticalisation over time.
- #19 — A clitic is syntactically independent but phonologically dependent on a host.
- #20 — Languages vary in how many morphemes they have per word (isolating to polysynthetic).
- #21 — Languages vary in whether they are primarily prefixing or suffixing; suffixing is cross-linguistically dominant.
- #22 — Languages vary in how easy it is to find morpheme boundaries (clear in agglutinative; fused in fusional).
- Isolating languages (Mandarin, Thai): morphology is minimal; the main NLP challenge is word segmentation, since standard orthography uses no whitespace.
- Agglutinative languages (Turkish, Finnish): morphology is highly productive; vocabulary is open-ended; fixed-vocabulary models face severe sparsity; subword tokenisation approximates but does not replicate morpheme structure.
- Fusional languages (Russian, Latin, Arabic): affixes fuse multiple features; morpheme boundaries are blurred; morphological analysis requires language-specific resources beyond simple suffix-stripping.
- Non-concatenative morphology (Arabic, Hebrew): the root–pattern template system requires fundamentally different processing architectures; standard concatenative models cannot handle it.
- Overarching lesson: NLP tools designed for English generalise poorly to typologically different languages. Morphological awareness — knowing what type of language you are processing — is a prerequisite for designing appropriate tokenisation, lemmatisation, and feature extraction strategies.
Proceed to Unit 3: Morphophonology when ready.