Learning Objectives

  • Explain what morphosyntax describes and distinguish it from morphology and syntax separately
  • Identify the major categories of morphological features (TAM, PNG, case) and give examples from multiple languages
  • Describe agreement systems and explain the concept of agreement controller and target
  • Analyse how cross-linguistic variation in morphosyntactic categories affects NLP system design

Reading

Read Chapter 4 (Morphosyntax) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.

1

Core Input

Read through each tab. Take notes on the key ideas before moving to the activities below.

Morphosyntax is the study of how the morphemes in a word affect its combinatoric potential — what it can combine with in a sentence. It sits at the interface between morphology (word structure) and syntax (sentence structure): the grammatical features encoded in word forms constrain and enable syntactic combinations.

The key insight is that words are not syntactically neutral. A verb inflected for third person singular present tense in Spanish (habla) cannot appear with a first person subject — the morphological form determines what the word can combine with. Case marking on a German noun determines which syntactic positions it can occupy.

A simple illustration of cross-linguistic difference in morphosyntactic patterning:

  • In English, adjectives appear before the noun they modify: the big dog. The ordering is a syntactic fact, but it is not enforced by morphological agreement — English adjectives do not agree with nouns.
  • In French, many adjectives appear after the noun: le chien grand (a post-nominal adjective), but they must also agree with the noun in gender and number: le grand chien (M SG) vs la grande chienne (F SG). Here morphological features constrain syntactic distribution.

NLP implication: morphosyntactic features are the mechanism by which word-level grammar and sentence-level grammar interact. An NLP parser that ignores morphosyntax will miss the constraints that rule out many ill-formed strings — and generate many of them in turn.

Verbs in many languages carry morphological features encoding information about the temporal and modal structure of the situation they describe. The standard grouping is TAM — tense, aspect, and mood:

  • Tense locates a situation in time relative to a reference point (usually utterance time): past, present, future. English encodes tense morphologically: walk / walked.
  • Aspect encodes the internal structure of the event — how it unfolds over time. The key distinction is between perfective (the event is viewed as a complete whole) and imperfective (the event is viewed as ongoing or habitual). English has limited aspect morphology (-ing for progressive); many other languages are more elaborate.
  • Mood encodes the speaker's epistemic or deontic stance toward the situation: indicative (asserted), subjunctive (hypothetical, subordinate), imperative (command). Romance languages have rich subjunctive morphology; English has almost none.

Languages differ sharply in which TAM distinctions they encode morphologically:

  • English is tense-prominent: tense is morphologically obligatory; aspect is optional and partly periphrastic.
  • Arabic has aspect (perfect/imperfect) but does not grammatically encode tense in the same way — temporal location is often inferred from context or adverbs.
  • Turkish encodes both tense and evidentiality as obligatory verb morphology — a speaker must grammatically mark whether they witnessed the event directly or inferred it.

NLP implication: machine translation between languages that encode TAM differently must restructure the entire temporal framework of the source — it is not simply a word-for-word substitution. A system translating from Mandarin Chinese (which marks aspect with particles but not tense morphologically) into Turkish (which requires both tense and evidentiality on every verb) must infer and generate information with no direct source counterpart.

Agreement is the phenomenon in which a morphological feature is marked on multiple elements within a phrase or clause, even though the feature semantically belongs to only one of them. Concept #38 captures this: the feature belongs to one element — the controller — and the other elements — the targets — carry agreement morphology that copies the controller's feature value.

Two major agreement systems are found cross-linguistically:

  • Verb–subject agreement — the verb agrees with its subject in person and/or number and/or gender. In Spanish, hablo (I speak) vs habla (he/she speaks): the verb form changes according to the subject's person and number. The subject is the controller; the verb is the target.
  • Adjective–noun agreement — adjectives agree with the noun they modify in number, gender, and sometimes case. In French, un grand livre (M SG, a big book) vs une grande table (F SG, a big table): the adjective grand takes different forms depending on the gender and number of the noun. The noun is the controller; the adjective is the target.

Agreement can also track abstract features not visible on the controller's surface form (#41): in British English, The committee have decided treats a formally singular noun as semantically plural. The target (the verb) agrees with a covert plural feature.

NLP implication: agreement errors — verbs that do not agree with their subjects, adjectives that do not agree with their nouns — are among the most frequent grammatical errors produced by NLP generation systems, particularly in morphologically complex languages. Tracking agreement across long spans of text is a non-trivial computational problem.

2

Key Concepts A: Morphological Features (#28–#37)

Expand each concept. Consider the NLP implication before reading the explanation.

Morphosyntax is the interface between morphology and syntax: the features encoded in word forms determine what those words can combine with. The word's internal structure (its morphemes) constrains its external syntactic behaviour (what it combines with).

A clear English example: the past tense morpheme -ed affects combinatoric potential in the auxiliary system. Had walked is grammatical; *is walked is not (unless passivised). The past tense form of the verb determines which auxiliaries can combine with it.

In case-marking languages, this is even more explicit. A German noun in the nominative case signals that it is the subject; a noun in the accusative signals the object. The case morpheme on the noun determines which syntactic position it fills — and which verbs or prepositions it can combine with.

NLP implication: a system that ignores morphosyntax cannot correctly determine what words can combine with. This affects every downstream task: parsing, semantic role labelling, machine translation, and grammatical error correction all depend on knowing the combinatoric potential of each word form.

TAM features encode temporal and modal information about the situation described. Languages differ substantially in how they encode these categories:

  • English tense: walk / walked / will walk. Past and present are encoded morphologically; future is periphrastic (auxiliary will).
  • Turkish TAM: Turkish encodes both tense and evidentiality as obligatory verb morphology. Gitti (he went — direct witness past) vs gitmiş (he went — hearsay/inference past). The morphology encodes whether the speaker witnessed the event or is reporting it secondhand.
  • Japanese aspect: aspect particles mark the temporal structure of the event: tabete iru (is eating, ongoing) vs tabete ita (was eating, past ongoing).

NLP implication: MT between languages with different TAM systems is not a word-for-word problem. A system must transfer temporal and modal information across radically different grammatical frameworks — inferring what the source language leaves implicit when the target language requires it to be explicit.

Nouns and noun phrases carry features about the entities they refer to:

  • Number: English cat / cats (singular/plural). Arabic has three-way number: singular, dual, and plural — kitāb (one book), kitābān (two books), kutub (books).
  • Gender: French le chat (M, the cat) vs la chatte (F, the female cat). German has three genders: der (M), die (F), das (N). Crucially, grammatical gender is not biological sex — French la sentinelle (sentry) is grammatically feminine regardless of the biological sex of the person.
  • Person: pronouns encode person (I / you / he / she / we / they). Some languages also encode person on nouns via possessive morphology.

NLP implication: grammatical gender is an arbitrary morphosyntactic feature, not a semantic one. NLP systems that conflate grammatical gender with biological sex produce systematic errors. MT systems must track noun gender to generate correct agreement on adjectives and verbs in the target language.

Case marking indicates the grammatical function of a noun phrase in the sentence — whether it is the subject (nominative), object (accusative), indirect object (dative), possessor (genitive), and so on. Case-marking languages include:

  • German: four cases — Nominativ, Akkusativ, Dativ, Genitiv. The definite article changes form: der Hund (the dog, NOM) vs den Hund (ACC) vs dem Hund (DAT) vs des Hundes (GEN).
  • Japanese: case is marked by postpositional particles: -ga (nominative subject), -wo (accusative object), -ni (dative/locative), -no (genitive). A parser reading Japanese reads the case particle — not the word order — to determine grammatical function.
  • Turkish: six cases: nominative (-Ø), accusative (-i), dative (-e), locative (-de), ablative (-den), genitive (-in). Case suffixes undergo vowel harmony with the root.

NLP implication: in case-marking languages, word order is freer than in English because grammatical function is marked morphologically rather than positionally. Parsers for these languages must read from case markers, not from position; parsers trained on English-like word order constraints will fail.

Many languages encode negation as a morphological feature on the verb or as a derivational prefix on adjectives:

  • English derivational negation: un- (unhappy, unkind), -less (harmless, careless), in-/im-/ir- (inefficient, impossible, irregular). These are attached to adjectives and some nouns.
  • Turkish verbal negation: the negative morpheme -me/-ma is inserted into the verb complex between the root and TAM suffixes: git-me-di (did not go). Negation is morphologically inside the verb, not a separate syntactic word.
  • Japanese verbal negation: -nai is a negative suffix: ika-nai (does not go), tabe-nai (does not eat).
  • Swahili: the negative prefix ha- attaches to the verb: anakwenda (he/she is going) vs hakwendi (he/she is not going).

NLP implication: negation scope is one of the hardest problems in NLP sentiment analysis and natural language inference. A system that processes words without tracking morphological negation markers will systematically misjudge polarity. In morphologically complex languages, negation is fused with the verb — the system must parse the morphological structure to detect it.

Evidentiality is the grammatically obligatory marking of the speaker's source of information — whether they witnessed the event directly, inferred it, or heard it from another person. In languages with grammatical evidentiality, the verb morphology encodes this distinction on every utterance.

  • Turkish: -di marks direct witness past; -miş marks hearsay or inference past. Ahmet geldi (Ahmet came — I saw it) vs Ahmet gelmiş (Ahmet came — I was told / I infer from evidence).
  • Bulgarian: has a grammatical reportative form used for events the speaker did not witness.
  • English: has no grammatical evidentiality. Source of information is expressed lexically if at all: apparently, reportedly, I heard that. A speaker can assert any proposition as fact, regardless of evidence.

NLP implication: LLMs generate statements with equal linguistic confidence whether the proposition is well-established, inferred, or fabricated. This is the grammatical encoding of the problem of hallucination: in a language with evidentiality, every assertion would be obligatorily marked for its epistemic status. LLMs trained on English have no such grammatical obligation — and this structural fact is part of what enables confident-sounding confabulation.

Definiteness encodes whether the referent of a noun phrase is assumed to be identifiable by the hearer (definite) or not (indefinite). Languages encode it in different ways:

  • English: a free article system — a/an (indefinite) vs the (definite). Articles are separate words.
  • Romanian: definiteness is a suffix: om (a man) → omul (the man); casă (a house) → casa (the house). The definite article is morphologically integrated into the noun.
  • Arabic: definiteness is a prefix: al-bayt (the house) vs bayt (a house). The prefix triggers assimilation with certain initial consonants.
  • Languages without articles: Japanese, Mandarin, and Russian have no grammatical articles; definiteness is inferred from context and word order.

NLP implication: article generation is a known difficulty for NLP systems. Speakers of article-free languages learning English frequently omit or misplace articles; NLP grammatical error correction must model definiteness to correct these errors. MT from article-free languages into English must generate articles from contextual inference.

Honorifics encode the social relationship between the speaker, the hearer, and/or the referent — respect, familiarity, formality, status. In some languages this is obligatory morphological marking on the verb:

  • Japanese keigo (polite/formal speech): iku (go, plain form) vs irassharu (go, honorific form used when referring to the actions of a person the speaker defers to). Multiple levels of keigo encode distinct social relationships. Using the wrong level is a serious social error.
  • Korean: verb endings encode the speaker-hearer relationship across six speech levels. The choice of ending signals the relative status of interlocutors.
  • Thai: politeness particles (khráp for male speakers, khâ for female speakers) are appended to utterances as obligatory politeness markers.

NLP implication: NLP systems that generate text in Japanese or Korean must track the social register of the interaction and select the appropriate honorific forms throughout the response. Mistaken honorific use is socially significant and perceived as insulting or inappropriate — not merely grammatically wrong. Most NLP systems default to a single register, missing the full communicative function of the language.

Possessive relationships — ownership, belonging, association — can be encoded morphologically on the noun that is possessed:

  • Turkish possessive suffixes: ev (house) → evim (my house) / evin (your house, sg.) / evi (his/her house) / evimiz (our house) / eviniz (your house, pl.) / evleri (their house). The possessor's person and number are encoded as a suffix on the possessed noun — no separate possessive pronoun is needed.
  • Hungarian: a similar system of possessive suffixes on the possessed noun.
  • English: the possessive clitic -'s attaches to the possessor noun phrase: the dog's bone, the teacher's pen.

NLP implication: information extraction systems looking for ownership and relational information must recognise morphological possessive marking. In Turkish, the possessive suffix is fused with the noun — a morphological analyser is required to segment and interpret the possessive relationship encoded in the surface form.

The categories in concepts #29–#36 are not an exhaustive list. Languages encode a wide range of additional grammatical notions morphologically, many of which have no equivalent category in English:

  • Directional morphemes: some Indigenous American languages mark whether motion is toward or away from the speaker as obligatory verb morphology. English expresses this lexically (come vs go), but the grammatical obligation is absent.
  • Switch reference: found in several Australian Aboriginal and Indigenous American languages, this morphological system marks whether the subject of the next clause is the same as or different from the subject of the current clause. It is an elaborate morphological anaphora-tracking system.
  • Mirativity: some languages mark grammatically whether the speaker considers the information new or surprising.
  • Telicity: the distinction between events with a natural endpoint (telic: build a house) and events without one (atelic: run) can be morphologically marked in some languages.

NLP implication: English morphosyntax is not a universal template. NLP systems built around English grammatical categories will lack the representations needed for these language-specific features. This is a fundamental obstacle to genuinely multilingual NLP — the categories themselves differ, not just the forms.

3

Key Concepts B: Agreement and Cross-linguistic Variation (#38–#43)

Continue with the remaining core concepts for this unit.

Agreement is not the duplication of information — it is the marking of a feature on a controller (the element that semantically "owns" the feature) and the copying of that feature value onto one or more targets. The feature belongs to the controller; the targets merely carry agreement morphology.

A classic example from Latin: bonus vir (a good man, nominative masculine singular). The noun vir (man) is the controller; it is nominative, masculine, and singular because of its grammatical role and inherent gender. The adjective bonus agrees with it — it carries nominative masculine singular endings not because it is nominative, masculine, and singular in its own right, but because the noun it modifies is. The accusative form of the same phrase is bonum virum: the case marker appears on both noun and adjective, but it belongs to the noun.

NLP implication: a parser must identify agreement relationships to correctly assign grammatical structure. Tracking which element is controller and which is target requires knowledge of morphosyntactic feature hierarchies — which elements own features, and which merely copy them.

Verb–argument agreement is found in the vast majority of the world's languages. The typical pattern is verb–subject agreement, but many languages extend this:

  • Spanish verb–subject agreement: como (I eat) / comes (you eat, sg.) / come (he/she eats) / comemos (we eat) / coméis (you eat, pl.) / comen (they eat). The verb encodes person and number; the subject pronoun is often omitted because the verb form already identifies the subject.
  • Basque: verbs agree with both the subject AND the object — the verb form encodes the person and number of all core arguments simultaneously. This is cross-linguistic evidence that agreement is not restricted to the subject.
  • Swahili: both subject and object agreement are marked as prefixes on the verb, from a system of noun classes (not just person/number): ni-na-ki-soma (I am reading it — subject prefix ni-, tense na-, object prefix ki- for a class-7 noun).

NLP implication: verb–argument agreement must be tracked across the full span of the sentence, not just at the immediately adjacent subject. In pro-drop languages (like Spanish) where the subject pronoun is frequently omitted, the parser must recover the subject from the verb's agreement morphology — not from an overt NP.

Within the noun phrase, agreement cascades from the noun (as controller) to all dependents — determiners, adjectives, numerals, and demonstratives:

  • French noun-adjective agreement: le grand chat (M SG, the big cat) vs la grande chatte (F SG, the big female cat) vs les grands chats (M PL). The adjective grand takes four forms depending on gender and number: grand / grande / grands / grandes.
  • Russian: adjectives agree with nouns in number, gender, and case, yielding a paradigm of approximately 24 distinct agreement forms (6 cases × 2 numbers × 2 paradigm types, with gender distinctions in the singular). Russian NLP generation must select the correct form from this paradigm for every adjective in the text.

NLP implication: AI-generated French text frequently contains agreement errors — an adjective in the wrong gender or number form. This is a documented failure mode of large language models. Because LLMs learn statistics over surface forms rather than tracking agreement features structurally, they misapply agreement in less common contexts (unusual noun gender, long distance between noun and adjective, complex NP structure).

Agreement tracks grammatical and conceptual features that may not be visible on the controller's surface form. This is called agreement with a covert feature or notional agreement:

  • English collective nouns: The committee has decided (singular — treating the group as a unit) vs The committee have decided (British English — treating the group as a collection of individuals). The noun committee is morphologically singular, but the verb can agree with a conceptual plural interpretation. The agreement target (the verb) is sensitive to an abstract feature, not the overt form of the controller.
  • Gender agreement with epicene nouns: in Spanish, el/la estudiante (the student — common gender noun). The gender of the referring pronoun and of agreement on adjectives must track the actual individual's social gender, which is not overtly marked on the noun.

NLP implication: agreement cannot be resolved by surface pattern-matching alone. Tracking covert features requires richer linguistic representations — including conceptual and social information that is not recoverable from the text surface. This is a deep challenge for NLP systems that rely only on observable token sequences.

Every language marks some grammatical information morphologically and expresses other information lexically, syntactically, or not at all. The selection differs dramatically across languages:

  • Turkish marks evidentiality morphologically; English does not. A Turkish speaker must grammatically encode the source of their information on every past tense verb; an English speaker need not.
  • Japanese marks politeness register morphologically; English does not.
  • English marks tense morphologically; Mandarin Chinese does not. In Mandarin, temporal information is conveyed through time adverbials and aspect particles, not tense morphemes.
  • German and Japanese mark case morphologically; English does so only for pronouns.

NLP implication: when translating from a language that morphologically marks feature X into one that does not (or vice versa), the translation system must either infer and generate information that was absent in the source, or accept that information present in the source must be lost or demoted. Cross-linguistic NLP cannot assume that the source and target languages share a set of morphological categories.

Even when two languages both mark the same category, they may draw different numbers of distinctions within it:

  • Number: English marks singular vs plural. Arabic marks singular / dual / plural — requiring a separate form for exactly two of something. Slovenian also has a dual number. A MT system from Slovenian to English must collapse the dual into the plural and lose the two-ness information.
  • Gender: English has no grammatical gender (only pronoun gender). French has two genders (masculine/feminine). German has three (masculine/feminine/neuter). Some languages have ten or more noun classes (sometimes called genders), each requiring its own agreement paradigm.
  • Person: English distinguishes first/second/third person. Some languages also distinguish inclusive vs exclusive first person plural — we including the hearer vs we excluding the hearer.

NLP implication: the finer the distinctions within a morphological category, the larger the inflectional paradigm and the more data an NLP system needs to learn each form reliably. Languages with 10+ noun classes, 6+ cases, and complex TAM systems — such as Finnish or Georgian — require substantially more annotated training data than English for the same task performance. This contributes to the performance gap between high-resource and low-resource languages in NLP.

4

Worked Examples: Morphosyntax in Action

Work through these examples to see how morphosyntactic features operate across several languages and what challenges they present for NLP.

Languages differ fundamentally in which TAM features are grammatically obligatory. The table below compares four languages across tense, aspect, and mood marking:

Language Tense morphology? Aspect morphology? Evidentiality? Notes
English Yes (past/non-past) Partial (-ing progressive) No (lexical only) Tense-prominent
Turkish Yes (past/present/future) Yes (progressive/habitual/perfect) Yes (direct vs hearsay) All three obligatory on every finite verb
Mandarin No (time adverbs used) Yes (particles: -le, -guo, -zhe) No Aspect-prominent; no tense morpheme
Arabic No (perfect/imperfect are aspect) Yes (perfect / imperfect) No Temporal location inferred from context

NLP challenge — MT between these languages: translating from Mandarin to Turkish requires a system to (1) infer temporal location from adverbs and context (not present in Mandarin morphology), (2) determine whether the speaker has direct or indirect evidence for the proposition (not present anywhere in the Mandarin source), and (3) generate the correct Turkish suffix encoding all three pieces of information. This is not a word-alignment problem — it is an inference and generation problem.

Case systems vary in size and in how they encode grammatical function. Here are Japanese particles and German case declension side by side:

Japanese case particles:

ParticleCase/FunctionExampleGloss
-gaNominative (subject)inu-ga hashittaThe dog ran
-woAccusative (object)neko-wo mitaSaw the cat
-niDative / Locativegakkō-ni ittaWent to school
-noGenitive (possessor)sensei-no honThe teacher's book
-deInstrumental / Locativebasu-de kitaCame by bus
-karaAblative (from)tōkyō-karaFrom Tokyo

German definite article declension:

CaseMasculineFeminineNeuterPlural
Nominativderdiedasdie
Akkusativdendiedasdie
Dativdemderdemden
Genitivdesderdesder

Contrast with English: English has almost no nominal case morphology — only pronouns retain case (I / me / my; he / him / his). English uses word order to indicate grammatical function: subject precedes verb, object follows. Japanese and German use case markers; word order is freer. Parsers for Japanese and German must read case markers; English parsers can rely on position.

Spanish verb agreement — hablar (to speak), present indicative:

FormPerson/NumberSubject pronoun (optional)
hablo1st sg.yo
hablas2nd sg.
habla3rd sg.él/ella/usted
hablamos1st pl.nosotros
habláis2nd pl.vosotros
hablan3rd pl.ellos/ellas/ustedes

Because each verb form encodes person and number, the subject pronoun is typically omitted in natural Spanish — the verb form alone identifies the subject. This is pro-drop: subject pronouns are optional because the agreement morphology makes them redundant.

French noun–adjective agreement:

NounGender/NumberAdjective formFull phrase
livre (book)M SGgrandun grand livre
table (table)F SGgrandeune grande table
livres (books)M PLgrandsde grands livres
tables (tables)F PLgrandesde grandes tables

AI-generated French agreement errors: this is a documented and recurring failure mode of large language models. Models frequently generate forms like *un grande livre (feminine adjective with masculine noun) or *les grand livres (singular adjective with plural noun). The errors cluster around: (a) nouns with non-transparent gender; (b) long distance between noun and adjective; (c) coordinated NPs. The statistical tendency of the model overrides the structural agreement requirement in complex contexts.

The following table shows which morphological features each of six languages marks grammatically. A "yes" means the feature is obligatorily and morphologically marked on at least one word class; "partial" means it is marked in some constructions; "no" means it is expressed lexically or not at all.

Feature English French German Turkish Japanese Mandarin
TenseYesYesYesYesNoNo
Grammatical genderNoYes (2)Yes (3)NoNoNo
CasePartial (pronouns)Partial (pronouns)Yes (4)Yes (6)Yes (particles)No
EvidentialityNoNoNoYesPartialNo
HonorificsNoPartial (tu/vous)Partial (du/Sie)NoYesNo
DefinitenessYes (articles)Yes (articles)Yes (articles)NoNoNo

Key observation: no single language encodes all features. English has articles and tense but no case, evidentiality, or honorifics. Turkish has case and evidentiality but no articles or grammatical gender. An NLP system designed around English grammatical categories will lack representations for the features it omits — and this is not merely a data problem, but an architectural one.

5

Check Your Understanding

Select the best answer for each question.

In Spanish, the verb form 'habla' encodes third person, singular, present tense, indicative mood. Which concept states that this morphological marking on the verb determines what it can combine with in a sentence?

Correct! Concept #28 — morphosyntax describes how the morphemes in a word affect its combinatoric potential. The form 'habla' can only appear with a third-person singular subject; *yo habla is ungrammatical because the person/number morpheme conflicts with the first-person subject. The morphological features encoded in the verb form directly constrain which subjects it can combine with.
Not quite — review the material and try again. Concept #28 — morphosyntax describes how the morphemes in a word affect its combinatoric potential. The form 'habla' can only appear with a third-person singular subject; *yo habla is ungrammatical because the person/number morpheme conflicts with the first-person subject. The morphological features encoded in the verb form directly constrain which subjects it can combine with.

A machine translation system translates from Mandarin Chinese into Turkish. The source language has no tense morphology, but Turkish requires both tense and evidentiality marking on every verb. Which two concepts best explain this translation challenge?

Correct! Concept #29 states that verbs can encode TAM morphologically. Concept #42 states that languages vary in which kinds of information they mark morphologically. Mandarin marks aspect but not tense; Turkish marks both tense and evidentiality obligatorily. The translation system must infer temporal and evidential information that is not present in the Mandarin source and generate the appropriate Turkish morphology — a problem of inference and generation, not merely transfer.
Not quite — review the material and try again. Concept #29 states that verbs can encode TAM morphologically. Concept #42 states that languages vary in which kinds of information they mark morphologically. Mandarin marks aspect but not tense; Turkish marks both tense and evidentiality obligatorily. The translation system must infer temporal and evidential information that is not present in the Mandarin source and generate the appropriate Turkish morphology — a problem of inference and generation, not merely transfer.
AI Dimension

Four issues from this unit have direct bearing on how AI language systems work and where they fail:

  • Agreement consistency in generated text — LLMs frequently produce agreement errors in morphologically complex languages such as French, Russian, and Spanish. They learn statistics over surface forms rather than tracking agreement features structurally across a phrase (#38, #39, #40). The errors cluster in contexts where agreement requires long-distance feature tracking or where noun gender is non-transparent. This is a known, documented failure mode — not a peripheral edge case.
  • Feature representation in neural models — neural NLP systems represent morphosyntactic features implicitly, as distributed vectors, rather than as discrete, trackable feature bundles (#28, #29). Rule-based and structured prediction systems model features explicitly and are more reliable for morphosyntactically demanding tasks — but they require extensive hand-engineering. The trade-off between implicit learned representations and explicit feature-tracking is a live issue in NLP system design.
  • Cross-linguistic transfer — an LLM fine-tuned primarily on English will lack rich representations for morphological features that English does not have. Evidentiality (#33), honorifics (#35), and large case paradigms (#31) are underrepresented in English-centric training data. Multilingual models partially address this — but the performance gap between English and low-resource morphologically complex languages reflects, in part, the mismatch between assumed and actual morphological categories (#42, #43).
  • Hallucination and evidentiality — LLMs generate statements with identical linguistic confidence regardless of whether the proposition is known, inferred, or fabricated. In a language with grammatical evidentiality (Turkish, Bulgarian), every assertion would be obligatorily marked for its epistemic status (#33). The absence of grammatical evidentiality in English — and hence in the training text of English-centric LLMs — is part of the structural explanation for why confident-sounding hallucination is so natural in these systems. The form of the language enables it.
6

Activities

Individual task — Spanish morphosyntactic feature analysis

Identify the morphosyntactic features (tense, person, number, gender, mood) encoded in each of the following Spanish verb forms. For each form, state what combinatoric constraints follow from those features — that is, which subject pronouns are compatible with the verb form, and which are not.

  • hablamos
  • hablarías
  • hablaron
  • hablaré
  • hablaba

For each form, structure your answer as: (a) person and number; (b) tense; (c) mood; (d) compatible subject pronouns; (e) incompatible subject pronouns, with a brief explanation of why they are incompatible.

Pair task — Agreement errors in AI-generated text

Use an AI language assistant to generate a short passage (8–10 sentences) in French, Spanish, or German on any topic of your choice. Then examine the generated text carefully for agreement errors.

For each agreement error you find:

  1. Identify which agreement relationship is violated — is it verb–subject agreement (#39) or determiner/adjective–noun agreement (#40)?
  2. State the controller (the element that owns the feature) and the target (the element that should be agreeing with it).
  3. Provide the correct form that should have been generated and explain what features it encodes.

If you find no errors, generate a more complex passage — try writing about an event in the past with multiple noun phrases and adjectives, or prompting the system to use a wider range of vocabulary. Discuss why agreement errors are more likely in some constructions than others.

Group task — Morphosyntactic feature matrix

Create a morphosyntactic feature matrix for five languages from different families. Your selection must include at least one SOV language, one VSO language, and one topic-prominent language.

For each language, research and record which of the following features are marked morphologically (and if so, on which word class):

  • Tense
  • Aspect
  • Evidentiality
  • Grammatical gender
  • Case
  • Definiteness
  • Honorifics / social register
  • Possessives

Once your matrix is complete, discuss as a group:

  1. Which English-centric NLP assumptions are challenged by your matrix? For example, which categories are assumed to be universal but are in fact absent in some of your languages?
  2. Which language in your matrix would be most difficult for a neural MT system trained predominantly on English data to handle? Justify your answer with reference to concepts #42 and #43.
  3. What would a truly language-neutral NLP architecture need to represent that current English-centric systems do not?

Review

  • #28 — Morphosyntax describes how morphemes in a word affect its combinatoric potential; morphological features constrain syntactic distribution.
  • #29 — Verbs (and adjectives) can morphologically encode tense (location in time), aspect (internal structure of event), and mood (speaker's stance).
  • #30 — Nouns can morphologically encode person, number, and gender; grammatical gender is not biological sex.
  • #31 — Nouns can morphologically encode case; case indicates grammatical function (subject, object, possessor, etc.).
  • #32 — Negation can be morphologically marked; in Turkish and Japanese, negation is a verbal suffix; correct negation scope is critical for NLP sentiment and inference tasks.
  • #33 — Evidentiality can be morphologically marked (Turkish, Bulgarian); English has no grammatical evidentiality, which is structurally related to the hallucination problem in LLMs.
  • #34 — Definiteness can be morphologically marked; expressed as free articles (English, French), suffixes (Romanian), or prefixes (Arabic); absent in Japanese, Mandarin, Russian.
  • #35 — Honorifics can be morphologically marked (Japanese keigo, Korean speech levels); mistaken register selection is socially significant.
  • #36 — Possessives can be morphologically marked on the possessed noun (Turkish, Hungarian); English uses the possessive clitic -'s.
  • #37 — Many further grammatical notions can be morphologically marked (directionality, switch reference, mirativity, telicity); English morphosyntax is not a universal template.
  • #38 — When an inflectional category is marked on multiple elements, it belongs to one (the controller) and expresses agreement on the others (the targets).
  • #39 — Verbs commonly agree with one or more arguments in person, number, and/or gender; agreement can track subject, object, or both (Basque, Swahili).
  • #40 — Determiners and adjectives commonly agree with nouns in number, gender, and case; AI-generated text in French and Russian frequently contains such agreement errors.
  • #41 — Agreement can track a feature not overtly marked on the controller (notional agreement); e.g. collective nouns in British English.
  • #42 — Languages vary in which kinds of information they mark morphologically; MT between languages with different feature inventories requires inference and generation, not just transfer.
  • #43 — Languages vary in how many distinctions they draw within each morphological category; finer distinctions require larger paradigms and more training data for NLP.

Agreement is the morphological phenomenon in which a feature is marked on multiple elements within a phrase or clause, where the feature semantically belongs to only one — the controller — and is copied to the others — the targets.

The main types of agreement are:

  • Verb–subject agreement (#39): the verb agrees with its subject in person, number, and sometimes gender. In Spanish, the six-way person/number paradigm is obligatory on every finite verb. In Basque, the verb agrees with both subject and object simultaneously.
  • Adjective–noun agreement (#40): adjectives must carry the gender, number, and case features of the noun they modify. In Russian, this yields a paradigm of approximately 24 distinct adjective forms.
  • Determiner–noun agreement (#40): determiners also agree with their noun head; in German, the definite article has 16 distinct forms across four cases, three genders, and two numbers.

Agreement is challenging for NLP for three reasons:

  1. Long-distance tracking — agreement must be maintained between controller and target even when other material intervenes. A relative clause between a noun and its predicate adjective can cause a system to lose track of the gender that must be copied.
  2. Covert features (#41) — the feature being agreed with may not be visible on the controller's surface form, requiring access to lexical or conceptual information beyond the text surface.
  3. Cross-linguistic variation in agreement systems (#42, #43) — different languages have different controllers, different targets, and different feature inventories for agreement. A system that models English agreement (minimal) cannot straightforwardly transfer to Russian or Basque.

The practical consequence is that agreement errors are among the most frequent grammatical failures of NLP generation systems in morphologically complex languages — and they are a direct reflection of the structural limitations of systems that learn surface statistics rather than tracking grammatical features explicitly.

Proceed to Unit 5: Syntax when ready.