Unit 9: Semantic and Syntactic Mismatches
Learning Objectives
- Describe the main types of syntactic-semantic mismatch — passive, raising, control, expletives, long-distance dependencies
- Explain how passive and related constructions rearrange the mapping between semantic roles and grammatical functions
- Identify raising and control verbs and distinguish their different argument structure properties
- Analyse how long-distance dependencies and argument drop challenge NLP parsing and reference resolution
Reading
Read Chapter 9 (Semantic and Syntactic Mismatches) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.
Core Input
Read through each tab. Take notes on the key ideas before moving to the activities.
The preceding units showed that syntactic structure provides scaffolding for semantic interpretation. But the mapping is not always direct: the same syntactic position can bear different semantic roles in different constructions, and the same semantic participant can appear in different syntactic positions. These mismatches arise from a set of well-studied grammatical constructions:
- Passive and related constructions — demote the Agent; promote the Patient
- Causatives — add an argument and shift positions of others
- Expletives — fill syntactic argument positions with semantically empty elements
- Raising verbs — provide a syntactic argument position with no local semantic role
- Control verbs — bind an argument syntactically and semantically across clause boundaries
- Long-distance dependencies — separate arguments from their associated heads
- Argument drop — omit arguments that are contextually recoverable
Understanding these constructions is essential for NLP systems that aim to interpret meaning, not just parse structure. A system that assumes subject = Agent and object = Patient will fail on all of them.
Passive (#84) — the core mismatch construction. In the active, Agent = subject and Patient = object. In the passive, Patient = subject; Agent = oblique (by-phrase) or omitted.
English: "The dog bit the man" → "The man was bitten by the dog." The man moves to subject position; the dog demotes to the by-phrase.
Related constructions (#85):
- Impersonal passive — in some languages, passives can be formed from intransitive verbs with no subject. German: "Es wurde getanzt" (it was danced = "there was dancing").
- Middle — "This bread cuts easily." No agent is expressed; the subject is Patient; the predication is about a property of the subject, not a specific event.
NLP: Passive voice accounts for 25–35% of main-clause verbs in some academic genres. Systems that read only surface structure will misidentify the Patient as the Agent.
Expletives (#88, #89) — semantically empty syntactic subjects.
- English "it" in weather constructions: "It is raining." — it has no referent; it merely fills the subject position required by English syntax.
- English "there" in existential constructions: "There is a problem." — there fills subject position; the logical subject is "a problem."
Raising verbs (#90) — "She seems to know the answer." Seem does not take a semantic subject. She is raised from the embedded clause where it is the subject of know. Syntax has a subject for seem; semantics does not assign her a role with respect to seem. Evidence: the expletive it can freely substitute — "It seems that she knows the answer" — confirming that seem does not require a meaningful subject.
Control verbs (#91) — "She tried to leave." Try takes a subject (she) and an infinitival complement (to leave). She is both the syntactic and semantic subject of tried AND the understood subject of leave. Contrast with raising: with control, the matrix subject has a semantic role with respect to the matrix predicate; with raising it does not.
Key Concepts A (Concepts #83–#91)
Expand each concept. Think about your answer before reading the explanation.
The phenomena covered in this chapter each obscure the surface-to-semantics mapping in a different way:
- Passive — Agent is not the subject; Patient is
- Causatives — a new Causer argument displaces the original subject
- Dative shift — the same participant appears in two different syntactic positions depending on construction
- Raising — the syntactic subject of the matrix clause has no semantic role in that clause
- Control — an implicit argument is bound across clause boundaries
- Expletives — syntactic subject positions are filled by semantically empty elements
- Long-distance dependencies — arguments are displaced from their associated heads
- Argument drop — arguments are omitted from the surface entirely
- Coordinated structures — one argument is shared by multiple predicates
NLP implication: a system that assumes subject = Agent and object = Patient will fail on all of these constructions. Robust semantic interpretation requires construction-aware processing.
Formal description of the English passive:
- The subject of the active (typically the Agent) is demoted to an oblique by-phrase or omitted entirely.
- The direct object of the active (typically the Patient) is promoted to subject position.
- English morphology: be + past participle; Agent (if expressed) marked by by.
Agentless passive: "The window was broken." — The Agent is omitted entirely and cannot be recovered from the sentence alone. This construction is extremely common in formal written text where the agent is unknown, unimportant, or deliberately concealed.
NLP: Passive occurs in approximately 25–35% of clauses in formal written text. SRL systems must identify the Agent even when it is expressed as a by-phrase oblique rather than as the subject.
Anti-passive (common in ergative languages): the Patient is demoted; the Agent remains as subject; the construction signals that the event lacks a specific, affected Patient. Found in Kinyarwanda (Bantu), Chukchi (Siberian), and many other ergative languages.
Impersonal passive: Agent and Patient both demoted; subject position filled by an expletive or left empty. German example:
Hier wird getanzt
here is-AUX danced-PASS
“There is dancing here.”
Middle: "This cloth washes well." The subject is the Patient; the construction expresses a general property of the subject (it is easy to wash), not a specific event with an identified Agent.
NLP: each construction produces a different surface argument pattern for the same underlying event type. SRL systems must handle all three variants.
English ditransitive verbs permit two surface realisations:
- "She gave the book to him." — PP dative: him is Oblique (Goal, inside PP); book is Direct Object (Theme).
- "She gave him the book." — Double-object construction: him is Indirect Object (Goal, bare NP); book is Direct Object (Theme).
Both describe the same event with the same semantic roles. But not all verbs allow both constructions:
- "She explained the theory to him." ✓
- *"She explained him the theory." ✗
NLP: Both constructions should extract the same relation (give; giver; given-thing; recipient), but surface argument positions differ. Relation extraction systems that read surface syntax without normalisation will assign different roles to the same participant.
Causative constructions add a Causer argument as the new subject; the original intransitive subject becomes the object of the causative verb:
- Turkish: gel- (come) → gel-dir- (cause to come / bring)
- Japanese: 食べる (taberu, eat) → 食べさせる (tabe-saseru, cause to eat)
先生が学生に本を読ませた
Sensei-ga gakusei-ni hon-wo yom-ase-ta
teacher-NOM student-DAT book-ACC read-CAUS-PAST
“The teacher made the student read the book.”
Lexical causative in English: "The doctor grew the bacteria" — a causative use of an otherwise intransitive verb. NLP: causative constructions add an argument and shift the position of the original subject; SRL must detect this alternation and correctly assign the Causer role.
Expletives, pleonastic pronouns, and dummy auxiliaries fill syntactic positions that have no semantic content:
- "It is easy to leave." — it is semantically empty; to leave is the logical subject.
- "There are three solutions." — there is empty; three solutions is the logical subject.
- French: "Il pleut" (it-rains) — expletive subject.
Mandarin Chinese has relatively few such constructions, but some discourse-initial particles serve a similar structural function.
NLP: Empty elements occupy syntactic argument positions and must be identified as non-referential to avoid incorrect entity extraction. Coreference resolution systems must not link expletive it or there to prior entities in the discourse.
Key properties of expletives:
- Syntactically present, semantically empty
- Cannot be questioned: *"What rained?"
- Cannot be replaced by a referential pronoun: *"It rained and so did it."
Examples: "It rained." / "It seems that she left." / "There is a problem." / "There appear to be errors."
NLP applications affected by expletive detection:
- Coreference resolution — expletive it must not be linked to a prior referent.
- SRL — the expletive position should not receive a semantic role label.
- Machine translation — many languages do not use expletives; the logical subject must be identified and translated appropriately.
"She seems to know the answer."
Seem does not assign a semantic role to its syntactic subject. She is raised from the embedded clause (she = subject of know). Evidence for raising:
- "It seems that she knows the answer" — same semantics; the expletive it can freely replace she as the syntactic subject of seem, confirming that seem does not require a meaningful subject.
- "There seem to be problems" — the semantically empty there raises into subject of seem, showing that any NP, even an expletive, can occupy this position.
NLP: raising verbs create a mismatch — the syntactic subject of the matrix clause is actually the semantic argument of the embedded predicate. SRL systems must detect this and assign the semantic role to the embedded predicate, not the matrix one.
"She tried to leave." — try is a subject-control verb.
- She is the syntactic and semantic subject of tried (she is performing the trying).
- She is also the understood subject of leave (she is the one who would leave).
- The two subjects are coreferential.
Contrast with raising: with a control verb, the matrix subject has a semantic role with respect to the matrix predicate (try predicts that its subject is attempting something). Diagnostic: *"It tried that she left" is ungrammatical — try requires a meaningful subject.
Object control: "She persuaded him to leave" — him is the object of persuaded AND the understood subject of leave.
NLP: control relations create implicit argument bindings — the understood subject of the embedded predicate must be recovered for full semantic interpretation. This is distinct from explicit coreference; it is licensed by the lexical properties of the control verb.
Key Concepts B (Concepts #92–#97)
Continue with the remaining concepts from Chapter 9.
In some languages, multiple predicates combine to license a single argument structure:
- Japanese compound verbs: two verbs fuse into a unit. 食べ始める (tabehajimeru, "start eating") — the arguments of the whole complex are distributed across both verbal elements.
- Urdu/Hindi light verb constructions: a noun or adjective combines with a light verb (do, give, take) to express an event: kām karnā (work do-INF = "to work"). The semantic content is in the noun; the argument licensing is shared.
NLP: argument extraction from complex predicate constructions requires joint processing of multiple predicates. Sequential independent extraction of one predicate at a time will miss the full argument structure.
Coordination creates non-standard dependency patterns:
- One-to-many: "John and Mary left." — John is subject of left AND Mary is subject of left; two subjects share one verb.
- Many-to-one: "She sang and danced." — she is subject of both sang and danced; one subject is shared across two verbs.
- Gapping: "She visited Paris, and he, London." — the verb visited is omitted from the second conjunct; its presence must be inferred.
NLP: coordination creates non-projective dependency structures that simple projective parsers cannot handle. Gapping and gap-coordination require dedicated parser handling; many state-of-the-art parsers still struggle with these constructions.
In questions, relative clauses, and topicalisations, an argument is displaced from its canonical position to a position earlier in the sentence — potentially far away:
- Wh-question: "What₁ did the dog chase t₁?" — what is the object of chase but appears sentence-initially.
- Relative clause: "The cat [that the dog chased t]" — gap inside relative clause.
- Topicalisation: "This book₁ I read t₁ last year."
- Deeply embedded: "What₁ did she say [that she believed [that he thought [that she had done t₁]]]?"
NLP: long-distance dependencies require unbounded dependency grammars or equivalent mechanisms. Transformer attention can in principle cover long-distance relations, but LLMs still underperform on complex nested long-distance dependencies, losing track of the filler by the time they reach the deep gap position.
Most languages keep modifiers adjacent to their head noun. Some allow or require separation:
- German extraposition: a relative clause may be moved to the end of the main clause even when its head noun is earlier — "Ein Mann ist gekommen [der niemand kannte]" (A man has come who nobody knew) — the relative clause is extraposed away from Mann.
- Dutch: PP modifiers of NPs can appear clause-finally, separated from their head.
- English extraposition: "A man arrived [who nobody knew]" — relative clause is separated from head noun man.
- Japanese: relative clauses are pre-nominal and can be very long; the entire clause precedes the noun it modifies.
NLP: parsers that assume adjacency for modifier attachment will fail on these constructions. Long-distance modifier attachment is a known source of parser error.
Languages vary considerably in how freely arguments can be omitted:
- Japanese: subjects and objects are freely dropped when contextually recoverable. 行った (itta, "went") — who went? Recoverable from context.
- Spanish pro-drop: subject pronouns are omitted because verb agreement encodes person and number. "Habla bien" (speaks well = "he/she speaks well").
- English object drop: "She was eating" — implicit object; the thing being eaten is unspecified.
- Italian: "Piove" (rains = "it is raining") — impersonal construction without an expletive subject.
NLP: argument drop requires systems to recover the referent of the dropped argument. This is zero-anaphora resolution, a major challenge in Japanese, Chinese, and pro-drop languages generally — even state-of-the-art models perform significantly below human levels on these tasks.
Not all dropped arguments resolve to a specific contextual referent:
- Definite drop (contextually recoverable): in Japanese subject/object drop, the referent is typically definite — the hearer knows who is meant from the preceding discourse. Zero-anaphora resolution aims to identify this antecedent.
- Indefinite drop (generic/unknown agent): impersonal constructions use a generic dropped subject with no specific referent. French on, German man, English one: "One should always be careful" — the subject is indefinite; there is no antecedent to resolve.
NLP: distinguishing definite drop (resolve to a specific antecedent) from indefinite drop (treat as generic/unknown) affects both coreference resolution and semantic role assignment. A coreference system that attempts to resolve every dropped argument to a prior entity will fail on impersonal indefinite constructions.
Worked Examples
Study each tab carefully. Make sure you can explain the NLP relevance of each example.
The following table shows active and passive equivalents in English, Japanese, and German, with the Agent's position marked in each case.
| Language | Active | Passive | Agent position in passive |
|---|---|---|---|
| English | The police arrested him. | He was arrested by the police. | By-phrase (by the police) |
| Japanese | Keisatsu-ga kare-wo taiho-shita police-NOM he-ACC arrested |
彼は警察に逮捕された
Kare-wa keisatsu-ni taiho-saretahe-TOPIC police-by arrested-PASSIVE |
Dative-ni phrase (keisatsu-ni) |
| German | Die Polizei verhaftete ihn. | Er wurde von der Polizei verhaftet. | Von-phrase (von der Polizei) |
| English (agentless) | — | He was arrested. | Agent omitted; must be inferred from world knowledge or context |
In all three languages, the passive moves the Patient to subject position. The morphological marking differs (English by, Japanese -ni, German von), but the underlying operation is the same (#84). Agentless passive is available in all three and is common in formal registers.
NLP implication: Passive voice must be detected and the Agent must be recovered from the oblique phrase (or flagged as absent). Surface-only SRL will misassign the Patient as the Agent in all passive sentences across all three languages.
Minimal pairs that distinguish raising from control verbs:
| Construction | Raising: seem | Control: want / try |
|---|---|---|
| Matrix sentence | She seems to know the answer. | She wants to leave. |
| Expletive substitution | It seems that she knows the answer. ✓ (same meaning) | *It wants that she leaves. ✗ (different meaning/ungrammatical) |
| Expletive subject | There seem to be problems. ✓ | *There wants to be a party. ✗ |
| Matrix subject role | she has NO semantic role with respect to seem | she IS the semantic subject of want (she desires something) |
| Embedded subject | she is the subject of know — raised | she is the understood subject of leave — controlled |
Object control: "She persuaded him to leave." — him is the object of persuade and also the understood subject of leave.
NLP implication: Distinguishing raising from control is essential for correct SRL. The syntactic subject of a raising verb is the semantic argument of the embedded predicate, not the matrix predicate. SRL systems that assign a role to the matrix predicate for raising verbs introduce a systematic error.
Long-distance dependencies (LDDs) are created when an argument is moved to the front of the sentence while a gap (t) marks its canonical position:
Simple question:
"What₁ did the researcher claim [that the model had predicted t₁]?"
what is the object of predicted — inside a complement clause of claim
Deeply nested:
"What₁ did she say [that she believed [that he thought [that she had done t₁]]]?"
The gap is three embedding levels deep; what must be linked back across all intervening clauses
Subject vs object relative clauses:
- "The dog that chased the cat" — subject relative; gap is in subject position (easier)
- "The cat that the dog chased" — object relative; gap is in object position (harder for humans and NLP)
This asymmetry is found cross-linguistically and is a consistent challenge for NLP systems. LLMs with transformer attention can in principle span these distances, but performance degrades significantly on deeply nested LDDs (#94).
Argument drop patterns across three languages:
Japanese — subject and object drop:
田中さんは知っています
Tanaka-san-wa shitte-imasu
Tanaka-TOP knows
“Tanaka knows [it / that].” — Object is dropped; recoverable from prior discourse.
Spanish — subject pro-drop:
"Habla bien."
speaks-3SG well
“He/she speaks well.” — Subject dropped; person/number encoded in the verbal suffix -a (3rd sg).
Mandarin Chinese — topic drop:
我买了一本书。很有意思。
Wǒ mǎi-le yī běn shū. Hěn yǒu yìsi.
I buy-PERF one CL book. Very interesting.
“I bought a book. [It] is very interesting.” — The topic (the book) is dropped in the second clause.
NLP implication (#96, #97): Zero-anaphora resolution for Japanese requires looking several sentences back in the discourse. Even state-of-the-art models perform significantly below human levels. For Spanish, verbal agreement provides a cue to person and number but not to the specific referent. For Mandarin, topic-drop requires discourse-level tracking across potentially long spans.
Check Your Understanding
Select the best answer for each question.
The sentence 'It seems that the experiment failed' can be paraphrased as 'The experiment seems to have failed.' In the second sentence, 'the experiment' is the grammatical subject of 'seems.' Which concept describes this construction?
In Japanese, subjects and objects can be dropped when contextually recoverable, but the dropped argument in impersonal constructions like 'One should be careful' is indefinite rather than contextually specific. Which concept captures this distinction?
Hallucination and passive voice
LLMs frequently produce passive constructions that omit agents: "the data was collected," "errors were introduced," "the decision was made." This conceals agency and accountability (#84, #82). When asked to identify who performed an action in an agentless passive, LLMs often hallucinate an Agent or incorrectly identify the Patient as the doer — a direct consequence of failing to distinguish surface subject from semantic Agent.
Pragmatic failure and expletives
LLMs sometimes treat expletive it or there as referential, linking them to prior entities in coreference chains. "It rained heavily. It ruined the crops." — the second it is referential (the rain); the first is expletive. Confusing the two leads to incorrect entity tracking and downstream errors in summarisation, QA, and relation extraction (#88, #89).
Context windows and long-distance dependencies
LLMs with limited attention depth underperform on sentences with long-distance dependencies spanning many intervening clauses (#94). This is a known limitation: the model loses track of the filler (what/who) by the time it reaches the gap position deep in the structure. Complex nested wh-questions and multi-clause relative clauses consistently reduce LLM accuracy relative to simpler sentences.
Argument drop and zero-anaphora
LLMs trained primarily on English text struggle with zero-anaphora in Japanese, Chinese, and Korean (#96, #97). The dropped argument may refer to an entity mentioned several sentences earlier in the discourse; without explicit pronouns, the model has no surface cue. This is a major cross-lingual NLP challenge where current models perform significantly below human levels — particularly for object drop, where the referent may need to be inferred from pragmatic context rather than grammatical agreement.
Activities
Individual task
For each sentence below, identify any syntactic-semantic mismatch and classify it using concepts #83–#97. For each, state: (a) what the surface syntactic structure is; (b) what the deep semantic structure is; and (c) which concept explains the mismatch.
- "It is raining heavily."
- "The bridge was built by a team of engineers."
- "She promised him to leave."
- "What did you say she claimed he had found?"
- "The contract seems to have been signed."
For sentence 5, note that it contains more than one mismatch-creating construction. Identify both and state which concept applies to each.
Pair task
Find five passive sentences in a news article or academic paper. For each sentence:
- Identify whether the Agent is expressed (by-phrase) or omitted.
- State the semantic roles of the surface subject and any other expressed arguments.
- Discuss whether the omission of the Agent is informative (e.g. Agent is unknown) or potentially problematic from an accountability perspective (e.g. deliberate concealment of responsibility).
Link your observations explicitly to #84 (passive as a grammatical process) and the AI Dimension discussion of agency in AI-generated text.
Group task
Compare argument drop in three languages: Japanese, Spanish, and one of Mandarin Chinese, Italian, or Turkish.
For each language, investigate:
- Which arguments can be dropped — subject only, object only, or both?
- What licensing conditions apply — agreement marking, contextual recoverability, construction type?
- How the definiteness of the dropped argument is determined (#97) — is the drop definite (recoverable antecedent) or indefinite (generic/unknown)?
Prepare a typological comparison table summarising your findings across the three languages. Then discuss: what specific challenges does each language's argument drop pattern pose for a cross-lingual coreference resolution system that must decide, for each dropped argument, whether to (i) search for a specific antecedent, (ii) flag as generic/indefinite, or (iii) treat as genuinely absent?
Review
- #83 — A variety of syntactic phenomena obscure the syntactic–semantic argument relationship
- #84 — Passive demotes the subject to oblique status; the next most prominent argument becomes subject
- #85 — Related constructions: anti-passive, impersonal passive, and middle
- #86 — English dative shift rearranges the syntactic positions of semantic arguments
- #87 — Morphological causatives add a Causer argument and shift the expression of the original subject
- #88 — Many (all?) languages have semantically empty words serving as syntactic glue
- #89 — Expletives fill syntactic argument positions with no associated semantic role
- #90 — Raising verbs provide a syntactic subject position with no local semantic role; the argument belongs to an embedded predicate
- #91 — Control verbs bind the subject (or object) as both a semantic argument and the understood subject of an embedded predicate
- #92 — Complex predicates distribute argument licensing across multiple predicates
- #93 — Coordination creates one-to-many and many-to-one dependency relations
- #94 — Long-distance dependencies separate arguments from their associated heads across arbitrary distances
- #95 — Some languages allow adnominal adjuncts to be separated from their head nouns
- #96 — Many (all?) languages allow argument drop; permissible drop varies by language and word class
- #97 — The referent of a dropped argument can be definite (contextually recoverable) or indefinite (generic/unknown)
Each mismatch type creates a different gap between surface form and deep meaning:
- Passive / impersonal passive / middle (#84, #85): the surface subject is the Patient, not the Agent; the Agent may be absent entirely. SRL systems must recover Agent and Patient regardless of voice.
- Expletives / raising / control (#88–#91): the surface subject may have no semantic role in the matrix clause (raising), or it may be semantically empty (expletives). Coreference systems must not link expletive pronouns to prior entities; SRL systems must assign roles to the correct predicate.
- Coordination and long-distance dependencies (#93, #94): shared arguments in coordination and displaced arguments in wh-constructions create non-local dependencies that simple positional heuristics cannot capture. Parsers require unbounded dependency mechanisms.
- Argument drop (#96, #97): absent surface arguments must either be resolved to a specific antecedent (definite drop) or recognised as generic (indefinite drop). Zero-anaphora resolution is a major unsolved challenge for Japanese, Chinese, and pro-drop languages.
NLP tasks affected: semantic role labelling, coreference resolution, machine translation, information extraction, question answering. All require processing that goes beyond surface syntactic analysis to recover the underlying semantic structure.
Proceed to Unit 10 when ready.