Unit 1: Introduction to Linguistics for NLP
Learning Objectives
- Explain why linguistic knowledge improves NLP system design and error analysis
- Define morphosyntax and distinguish it from a bag-of-words representation
- Describe at least three strategies languages use to express grammatical relationships
- Identify the main approaches to NLP and what linguistic information each relies on
Reading
Read Chapter 1 (Introduction) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.
Core Input
Read through each tab. Take notes on the key ideas before moving to the activities below.
Language is not a list of words. It is a structured system in which meaning is encoded through form — and form is governed by rules. Two sentences with identical words in different orders mean different things, or are ungrammatical:
- The dog bit the man. — the dog is the biter
- The man bit the dog. — the man is the biter
- *Bit man the dog the. — ungrammatical in English (asterisk marks ungrammaticality)
Linguists describe language at several levels of analysis, each building on the one below:
- Phonology — the sound system; how sounds contrast and combine
- Morphology — internal word structure; how meaningful units (morphemes) combine to form words
- Syntax — sentence structure; how words combine into phrases and clauses
- Semantics — meaning; what words, phrases, and sentences denote
- Pragmatics — language in use; how context shapes interpretation
NLP systems operate across all of these levels, whether or not they are designed with them explicitly in mind. A system that ignores linguistic structure is, at best, incomplete.
Natural Language Processing (NLP) is the computational analysis, generation, and transformation of human language. NLP tasks include machine translation, sentiment analysis, named entity recognition, question answering, and speech recognition.
Three broad approaches have shaped the field:
- Rule-based systems — encode linguistic knowledge explicitly as formal rules (grammars, lexicons, pattern matchers). Precise but labour-intensive and brittle when language varies from the assumed norm.
- Statistical systems — learn patterns from large corpora without explicit rules. Flexible and scalable, but dependent on data distribution. A rare but grammatical construction may be invisible to such a system.
- Neural systems — use deep learning to learn representations directly from data. Currently dominant. Highly capable on average-case inputs, but prone to systematic failures that reflect gaps in their implicit linguistic models.
Linguistic knowledge is valuable across all three approaches — for designing features in statistical systems, diagnosing failure modes in neural systems, and building rules in rule-based systems. Concept #0 makes this explicit: knowing about linguistic structure is important for feature design and error analysis in NLP.
There are approximately 7,000 known living languages distributed across 128 language families (#5). The major families include Indo-European (English, Hindi, Urdu, French, German), Sino-Tibetan (Mandarin Chinese, Tibetan), Afro-Asiatic (Arabic, Amharic), Dravidian (Tamil, Telugu), Japonic (Japanese), and Kra-Dai (Thai, Lao), among many others.
Languages can be classified in three ways (#4):
- Genetically — by shared ancestry. English and Hindi are both Indo-European, descended from a common ancestral language.
- Areally — by geographic proximity and contact. Languages in the same region often share features regardless of family.
- Typologically — by structural properties such as word order (SVO, SOV, VSO) or morphological type. This classification is directly relevant to NLP system design.
Most NLP research and most training data is concentrated in a small number of languages — primarily English. Systems designed with only English in mind often fail on other languages, because the structural properties of English are not universal (#6).
Key Concepts: Structure and Meaning
Expand each concept. Consider the implications for NLP before reading the explanation.
This is the founding premise of the course. NLP engineers who understand linguistic structure make better decisions at every stage of system development:
- Feature design — knowing that morphology encodes tense, number, and case means you can design features that capture these distinctions, rather than treating all word forms as unrelated tokens.
- Error analysis — when a system fails, linguistic knowledge tells you why. A system that confuses bank (financial) with bank (river) is failing at word sense disambiguation. A system that misattaches a prepositional phrase is failing at syntactic parsing. Without linguistic concepts, errors look random; with them, patterns emerge and can be addressed.
This concept applies regardless of whether the system is rule-based, statistical, or neural. Even a neural model that learns representations automatically will learn them better — and fail more predictably — if its designers understand the underlying linguistic structure.
A bag of words is an unordered set of tokens. It records which words appeared, but nothing about how they relate to each other. The sentences The dog bit the man and The man bit the dog have identical bags of words — but entirely different meanings.
Morphosyntax is the system of constraints that governs how words combine — both their physical form (morphology) and their arrangement (syntax). It is what transforms a bag of words into a sentence with a determinate meaning.
Any NLP task that requires understanding who did what to whom — relation extraction, question answering, machine translation, semantic role labelling — requires access to morphosyntactic information, whether explicitly represented or implicitly learned.
Morphosyntactic constraints operate at two levels simultaneously:
- Form constraints — rules about which combinations of words are grammatical. In English, adjectives precede nouns (the big dog), not follow them (*the dog big). In French, they typically follow (le grand chien — literally "the dog big"). These are language-specific form constraints.
- Meaning constraints — grammatical form signals meaning. In English, word order determines who is the agent and who is the patient. In German and Urdu, case morphology does the same work, allowing more flexible word order without losing the meaning distinction.
An NLP system that ignores these constraints will misinterpret grammatical structure and generate ungrammatical output. Machine translation systems that treat source and target languages as having the same morphosyntax produce systematically wrong translations.
The central task of sentence-level meaning is establishing predicate-argument structure: identifying the event or state described, and the participants in it. Languages achieve this through several strategies:
- Word order — English relies heavily on word order. The first noun phrase before the verb is typically the agent; the one after is the patient. The cat chased the mouse vs The mouse chased the cat.
- Case marking — Urdu, Hindi, German, Japanese, and many other languages use morphological case markers on nouns to identify grammatical roles. In Japanese, -ga marks the subject and -wo marks the object, allowing relatively free word order without ambiguity.
- Verb agreement — verbs in many languages agree with their subjects (and sometimes objects) in person, number, and gender, providing redundant cues to grammatical structure that can aid parsing.
- Topic marking — languages like Japanese and Mandarin Chinese use topic particles or fronting to signal discourse topic, which does not always align with the grammatical subject.
No single strategy is universal. NLP systems designed around English's reliance on word order will fail on languages that use other strategies as their primary mechanism.
Key Concepts: Language Diversity
Continue with the remaining core concepts for this unit.
Each classification scheme reveals different things about languages and their relationships:
Genetic classification groups languages by descent from a common ancestor. The Indo-European family includes English, French, German, Russian, Hindi, Urdu, and Persian — all descended from Proto-Indo-European. Languages within a family often share structural features, but these features can diverge substantially over time.
Areal classification groups languages by geographic contact, regardless of family. The Balkan sprachbund groups Bulgarian, Romanian, Albanian, and Greek — from different families — because centuries of contact produced shared grammatical features. South Asian languages from multiple families share retroflex consonants and postpositions.
Typological classification groups languages by structural properties. Word order typology classifies languages as SVO (English, Mandarin Chinese, Thai), SOV (Japanese, Urdu, Hindi, Turkish, Korean), VSO (Welsh, Classical Arabic), and other patterns. This is directly relevant to NLP: a machine translation system translating between an SVO and an SOV language must restructure the argument order of every clause.
The scale of linguistic diversity is a practical challenge for NLP:
- The top 10 languages by speaker count account for roughly half the world's speakers. The remaining half speak one of approximately 6,990 other languages.
- The vast majority of these languages have no significant digital presence — no large text corpora, no annotated training data, no developed NLP tools. These are low-resource languages.
- Even among well-resourced languages, most NLP research focuses on English. The next most-studied languages (Mandarin Chinese, German, French, Spanish, Arabic) are a distant second.
This matters ethically as well as technically. NLP systems that work well only for speakers of a small number of dominant languages reproduce and amplify existing inequalities in access to technology. Linguistically informed NLP design is also a question of fairness.
A system designed with only one language's structural properties in mind will generalise poorly. Common English-specific assumptions that fail cross-linguistically include:
- Whitespace as a word boundary — reliable in English, but Chinese, Japanese, and Thai use no spaces between words in standard orthography.
- Tokenisation by splitting on spaces — fails for agglutinative languages like Turkish or Finnish, where a single written word may encode what English expresses as a full clause.
- Fixed SVO word order for parsing — incorrect for SOV languages like Japanese, Korean, Urdu, and Turkish, where the verb appears at the end of the clause.
- Small vocabulary assumptions — morphologically rich languages have far larger surface vocabularies than English; fixed-vocabulary models handle them poorly.
NLP systems that incorporate typological knowledge — knowing that the language being processed is SOV, or highly agglutinative, or uses a logographic writing system — can be designed to handle structural variation more robustly from the outset.
Worked Examples: Language in Action
The same meaning can be expressed through radically different structures across languages. Work through these examples before completing the quiz.
Word order encodes meaning in English. The position of noun phrases relative to the verb determines who is the agent and who is the patient.
| Sentence | Agent (biter) | Patient (bitten) |
|---|---|---|
| The dog bit the man. | dog | man |
| The man bit the dog. | man | dog |
English has almost no case morphology on nouns, so word order is the only reliable cue to grammatical role. Swap the noun phrases and the meaning reverses entirely.
NLP implication: any model processing English must be sensitive to word order. A bag-of-words model assigns identical representations to both sentences above — which is a fundamental failure for any task requiring role identification.
Case marking allows flexible word order. In Japanese and Urdu, grammatical role is marked on the noun phrase itself, so word order can vary without changing meaning.
Japanese (default SOV, but order is flexible with case particles):
猫が ネズミを 追いかけた。
Neko-ga nezumi-wo oikaketa.
cat-NOM mouse-ACC chased
“The cat chased the mouse.”
The particles -ga (nominative/subject) and -wo (accusative/object) mark grammatical roles regardless of word order. The same particles on reversed noun phrases would reverse the meaning.
Urdu:
بلی نے چوہے کو پکڑا۔
Billī-ne cūhē-ko pakṛā.
cat-ERG mouse-ACC caught
“The cat caught the mouse.”
NLP implication: a parser trained on English word-order rules will systematically misanalyse Japanese and Urdu sentences. Grammatical role must be read from the case marker, not from position.
Topic and subject are not the same thing. In topic-prominent languages, a noun phrase can be fronted to signal discourse topic — independently of its grammatical role.
Mandarin Chinese:
那本书,我已经看完了。
Nèi běn shū, wǒ yǐjīng kàn-wán le.
that CL book I already read-finish PERF
“That book, I’ve already finished reading [it].”
Nèi běn shū ("that book") is the topic — fronted for discourse reasons — but it is not the grammatical subject. The subject is wǒ ("I").
Japanese は (wa) vs が (ga):
魚は 鯛が おいしい。
Sakana-wa tai-ga oishii.
fish-TOPIC sea bream-NOM delicious
“As for fish, sea bream is delicious.”
NLP implication: conflating topic with subject — a common error in systems built on English assumptions — leads to incorrect semantic role assignment in Chinese and Japanese.
Writing systems vary radically across languages, affecting tokenisation, segmentation, and every step of text preprocessing in NLP.
| Language | Script type | NLP challenge |
|---|---|---|
| English | Alphabetic (Latin) | Whitespace tokenisation mostly works; punctuation handling needed |
| Mandarin Chinese | Logographic (Hanzi) | No spaces between words; word segmentation is a non-trivial task |
| Japanese | Mixed (kanji + hiragana + katakana) | Three scripts in one text; script type signals word class |
| Thai | Syllabic abugida | No spaces between words; spaces mark phrases not word boundaries |
| Urdu | Perso-Arabic (Nastaliq) | Right-to-left; short vowels often omitted in script |
| Hindi | Devanagari (syllabic) | Conjunct consonants; sandhi rules affect written word boundaries |
NLP implication: a preprocessing pipeline designed for English will produce incorrect tokenisation for every language in this table. Multilingual NLP requires script-aware processing from the very first step.
Check Your Understanding
Select the best answer for each question.
A bag-of-words model assigns the same representation to 'The dog bit the man' and 'The man bit the dog'. Which Bender concept directly explains why this is a problem for NLP?
An NLP parser is trained on English and applied to Japanese text. It identifies noun phrases as subjects based on their position before the verb. Which concept best explains this failure?
Large language models (LLMs) are trained on vast quantities of text to predict the next token in a sequence. They produce fluent output — but fluency is not understanding. Three failure patterns follow directly from the concepts in this unit:
- English-centric bias — most training data is English. LLMs learn English morphosyntactic patterns implicitly and apply them cross-linguistically, producing errors in languages with different word order, case systems, or morphological type (#6).
- Structural blindness — despite their sophistication, LLMs sometimes behave as if sensitive only to word co-occurrence rather than structure. They can confuse agent and patient in sentences that differ only in word order, particularly in lower-resource languages (#1, #3).
- Surface fluency without structural grounding — a model can generate a perfectly fluent sentence with incorrect predicate-argument structure. The output sounds right; the meaning is wrong. Without linguistic evaluation criteria, this error goes undetected (#0).
Understanding these failure modes is the first step towards building more robust systems — and towards evaluating AI outputs critically rather than accepting fluency as a proxy for correctness.
Activities
Individual task — Ambiguity analysis
Each of the following sentences has more than one possible interpretation. For each one:
- State both interpretations in plain English.
- Identify what type of linguistic information is needed to resolve the ambiguity — word meaning, word order, pronoun reference, or sentence structure.
- Describe in one sentence how an NLP system might fail on this input.
- I saw the man with the binoculars.
- The chicken is ready to eat.
- She told her sister she had won.
Pair task — Comparing NLP system behaviour
With a partner, choose one NLP application you have both used — for example, a machine translation tool, a voice assistant, or a search engine.
Find at least two examples where the system produced an incorrect or unexpected output. For each example:
- Describe the input and the system's output.
- Identify which linguistic level the failure involves — morphology, syntax, semantics, or pragmatics.
- Suggest what linguistic information the system would need to handle this input correctly.
Be prepared to share your examples in the group discussion.
Group task — Mapping language diversity
As a group, select five languages from five different language families. For each language, research and record:
- Its language family and approximate number of speakers
- Its basic word order (SVO, SOV, VSO, or other)
- Whether it uses case marking, agreement, or word order as its primary strategy for indicating grammatical roles (#3)
- One feature of the language that you think would challenge an NLP system designed primarily for English
Compile your findings into a brief comparison table. This exercise builds the cross-linguistic awareness that runs through the rest of the course.
Review
- #0 — Knowing about linguistic structure is important for feature design and error analysis in NLP.
- #1 — Morphosyntax is the difference between a sentence and a bag of words.
- #2 — The morphosyntax of a language is the constraints it places on how words combine in form and meaning.
- #3 — Languages use morphology and syntax to indicate who did what to whom, and use a range of strategies to do so.
- #4 — Languages can be classified genetically, areally, or typologically.
- #5 — There are approximately 7,000 known living languages distributed across 128 language families.
- #6 — Incorporating information about linguistic structure and variation can make for more cross-linguistically portable NLP systems.
- Rule-based — linguistic knowledge is encoded explicitly as grammars and lexicons. Requires deep expertise; fragile to variation.
- Statistical — patterns are learned from corpora. Linguistic knowledge guides feature design and evaluation.
- Neural — representations are learned from data. Linguistic knowledge is essential for error analysis and for understanding systematic failure modes.
Across all three, concept #0 holds: knowing about linguistic structure makes you a more effective NLP engineer.
Proceed to Unit 2: Morphology when ready.