Unit 6: Parts of Speech

Learning Objectives

Define parts of speech distributionally and functionally, and explain why both definitions are needed
Identify open and closed class categories and describe their different behaviours in NLP
Explain why there is no universal set of parts of speech even among major categories
Analyse category ambiguity and its consequences for POS tagging systems

Reading

Read Chapter 6 (Parts of Speech) of Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax. Morgan & Claypool. Use the course materials below to activate and consolidate the concepts from that chapter.

Core Input

Read through each tab. Take notes on the key ideas before moving to the activities.

Parts of speech (also called lexical categories or word classes) are grammatical categories that group words with similar syntactic behaviour. Two definitional approaches have been proposed, corresponding to Concepts #47 and #48.

Distributional definition (#47): Words belong to the same category if they appear in the same syntactic environments — the same positions in a sentence, take the same affixes, and appear with the same dependents. For English nouns, the distributional tests include: can follow a determiner (the ___), can take plural -s, can appear as the subject of a verb. For English verbs: can follow a modal auxiliary (can ___), take past tense -ed, appear with -ing.

Functional definition (#48): Nouns refer to entities or things; verbs refer to events or states; adjectives refer to properties. But functional definitions cannot be the whole story. Consider:

The arrival of the train was delayed. — arrival is a noun, but it names an event, not a thing.
His running away was unexpected. — running is a gerund (a noun), but it names an action.
She knows the answer. — knows is a verb, but it names a state, not an event.

Category is determined by distribution, not by content alone. Functional intuitions are useful heuristics but cannot replace distributional criteria.

Words classes are divided into open and closed classes based on whether new members can be added.

Open classes — nouns, verbs, adjectives, adverbs — can receive new members. New words enter these classes regularly: selfie (noun), vlog (noun/verb), tweet (verb), ghosting (noun/gerund). These are content words: they carry the primary semantic information in a sentence.

Closed classes — determiners, conjunctions, prepositions, pronouns, auxiliary verbs — have fixed membership. New members are rarely added. These are function words or grammatical words: they signal grammatical relationships rather than referential content.

The open/closed distinction matters for NLP in several ways:

Closed class words are highly frequent: function words dominate token counts even though the vocabulary of function words is tiny.
Open class words carry most of the semantic content: removing function words loses little information (but note: this is not always safe).
The distinction underlies stop-word removal, TF-IDF weighting, and other information-retrieval techniques where high-frequency, low-content words are down-weighted or removed.
New words entering open classes (neologisms, named entities, social media slang) create out-of-vocabulary challenges for NLP systems trained on fixed corpora.

POS tagging is the task of assigning a part-of-speech label to each token in a text. It is one of the earliest and most fundamental NLP preprocessing steps.

Major tagsets:

Penn Treebank — 36+ tags for English; widely used in classical NLP research. Distinguishes, for example, NN (noun singular), NNS (noun plural), NNP (proper noun singular), VB (verb base form), VBD (verb past tense), JJ (adjective).
Universal Dependencies (UD) — 17 universal tags designed to work across 100+ languages: NOUN, VERB, ADJ, ADV, DET, PRON, ADP, CCONJ, SCONJ, PART, NUM, INTJ, PUNCT, SYM, X, and more.

Applications: parsing, named entity recognition, machine translation, information extraction, text-to-speech, and grammar checking.

Approaches:

Rule-based taggers — use explicit morphological and syntactic rules; highly accurate on expected input but brittle on novel text.
Statistical taggers — Hidden Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs); learn tag transition probabilities from annotated data.
Neural taggers — BiLSTM + CRF, transformer-based models; state of the art on standard benchmarks.

Accuracy: ~97% on English well-formed text; drops significantly for social media, historical text, and non-standard dialects. Ambiguity is the core challenge (#47): the same word can belong to different categories in different contexts. Tagsets designed for one language are not automatically transferable to another (#49).

Key Concepts: Distributional and Functional Definitions (#47–#48)

Expand each item. Think about your answer before reading the explanation.

The distributional definition holds that words belong to the same category if they occupy the same syntactic positions, take the same inflectional affixes, and co-occur with the same dependents. Category membership is determined by behaviour in context, not by content.

English test environments for nouns:

Can follow a determiner: the ___, a ___, this ___
Can take plural -s: dogs, ideas, arrivals
Can function as the subject or object of a verb

English test environments for verbs:

Can follow a modal auxiliary: can ___, will ___, must ___
Take past tense -ed: walked, arrived
Appear with -ing: walking, arriving

Crucially, distributional definition works for language description without requiring us to know what things, events, or properties "are" in any metaphysical sense. This is why linguists can describe languages they do not fully understand culturally — they identify categories from patterns of occurrence.

Cross-linguistic application: Each language must be described on its own distributional terms. The test environments for "noun" in Japanese differ from those in English; Japanese nouns are followed by case particles, not preceded by articles. Importing English categories wholesale to describe another language is a methodological error.

NLP relevance: Statistical and neural POS taggers implicitly apply distributional criteria — they learn, from annotated data, which syntactic contexts are associated with which labels. A tagger trained on English distributional patterns will not transfer reliably to a language with different distributional patterns.

The functional definition holds that nouns name entities, verbs name events or states, and adjectives name properties. This is a useful heuristic for quick identification. But "not metaphysically" is a critical qualifier: Bender is rejecting the schoolroom definition of a noun as "a person, place, or thing."

Why the functional definition cannot be the whole story:

Gerunds are nouns that name actions: Swimming is fun. (swimming is a gerund — it is a noun grammatically, but it names an action.)
Event nouns are nouns that name events: the explosion, the arrival, the discovery. These are grammatically nouns (they can follow a determiner, take plural -s), but they name events, not things.
Abstract nouns name abstract properties: beauty, justice, freedom. Not persons, places, or things in any intuitive sense.
Stative verbs name states, not events: know, believe, contain, resemble. These are grammatically verbs (they follow modals, take -ed), but they do not name events in the dynamic sense.

The functional definition serves as a first approximation — it is useful for pedagogical explanations and annotation guidelines — but the distributional definition is the authoritative criterion.

NLP relevance: Annotation guidelines for POS tagging often invoke functional intuitions to help annotators make initial decisions, but edge cases are always resolved by distributional tests. An NLP system that relies solely on semantic content to assign POS labels will make systematic errors on event nouns, gerunds, and stative verbs.

Key Concepts: Universality and Phrasal Categories (#49–#50)

Expand each item.

Even noun and verb — the two most basic categories across the world's languages — are not universally present as distinct morphosyntactic classes.

Noun/verb distinction:

In Mandarin Chinese, Riau Indonesian, and some analyses of Lao, words can appear in positions that would be occupied by both nouns and verbs in English, without any morphological change. The same form can appear as the head of what looks like an NP in one sentence and the head of a predicate in another.
This does not mean these languages have no grammar — it means the noun/verb distinction may not be the right descriptive tool for them.

Adjective:

In many languages (e.g. Yoruba, classical Chinese), what English expresses with adjectives is expressed with stative verbs. There is no separate adjectival class: "the house is big" is expressed with a verbal predicate meaning roughly "the house bigs."
Japanese has two distinct adjectival classes: i-adjectives (keiyōshi) which inflect for tense and polarity, and na-adjectives (keiyōdōshi) which take the particle na before nouns. These two classes behave very differently and have no direct English equivalent.

Preposition:

Many languages use postpositions (Japanese, Turkish, Hindi) or case-marking affixes (Finnish, Russian, Latin) instead of prepositions. The category "adposition" covers both, but it is not universal: some languages use neither.

NLP implication: Universal tagsets such as Universal Dependencies must make principled design choices about which categories to include, which to treat as language-specific subtypes, and how to handle languages that lack certain distinctions. These choices affect annotation quality and the reliability of cross-lingual model transfer. A multilingual model that assumes a universal noun/verb distinction may behave unpredictably on languages where that distinction is absent or gradient.

Just as individual words are assigned POS categories, so are phrases. A phrase's category is determined by its head word (the concept developed further in Unit 7):

A noun phrase (NP) has a noun as its head and distributes like a noun — it can appear wherever a noun appears (as subject, object, after a determiner).
A verb phrase (VP) has a verb as its head and distributes like a verb — it can serve as the predicate of a sentence.
A prepositional phrase (PP) has a preposition as its head and can function as an adjunct to nouns or verbs, in the same positions as adjectives or adverbs.
An adjective phrase (AdjP) has an adjective as its head: very proud of her achievement distributes like a simple adjective.

This insight underlies both X-bar theory (in generative grammar) and constituent categories in phrase structure grammars: the grammatical category of a phrase is not arbitrary — it is inherited from and projected by its head.

NLP relevance: Phrase-level categories appear in phrase structure trees and dependency graphs. A parser that identifies an NP can treat it as a unit regardless of its internal complexity — whether the NP is just the dog or the large spotted dog that knocked over the bin, it distributes like a noun. Understanding this projection principle is fundamental to parsing, coreference resolution, and information extraction. Concept #50 also connects directly to attachment ambiguity: knowing whether a PP distributes as an adverb (VP adjunct) or as an adjective (NP adjunct) requires understanding phrasal categories.

Worked Examples

Study each worked example. Connect each case to the relevant concept numbers.

Many English words belong to multiple categories. Category membership in a given token is determined by its distributional context — syntactic position, co-occurring words, morphological form — not by the word form itself (#47).

Selected examples:

Word	As noun	As verb	As adjective	As adverb
light	the light	to light the fire	a light jacket	—
run	a home run	to run a mile	—	—
fast	—	to fast (abstain from food)	a fast car	run fast
round	a boxing round	to round the corner	a round table	come round
close	a close call	to close the door	close range	live close by
well	a water well	tears welled up	—	done well
back	the back of the chair	to back the proposal	the back door	step back
mean	the mean (average)	to mean something	a mean remark	—
still	a distilling still	to still one's fears	still water	still running
work	the work	to work	work schedule	—

Disambiguation requires context. Unigram taggers — which always assign the most frequent tag for a word regardless of context — achieve approximately 90% accuracy on standard English text. Context-sensitive taggers that examine surrounding words achieve approximately 97%. The gap represents ambiguous words where the most frequent tag is wrong in a substantial number of contexts.

Japanese has a well-developed POS system with several categories that have no direct English equivalent — illustrating that even "adjective" is not a universal category with uniform properties (#49).

I-adjectives (形容詞 keiyōshi):

I-adjectives inflect for tense and polarity, much like verbs. They conjugate in a way that English adjectives do not.

おいしい → おいしかった

oishii → oishikatta

delicious → was-delicious

“delicious” → “was delicious”

Notice: the tense information is incorporated into the adjective itself — there is no separate copula (be) in the past affirmative form.

Na-adjectives (形容動詞 keiyōdōshi):

Na-adjectives do not inflect in the same way. When modifying a noun, they require the particle na.

静かな町

shizuka-na machi

quiet-NA town

“a quiet town”

The two adjective classes have different distributions and different morphological behaviour — they are not merely variants of the same category. NLP systems trained on English adjective patterns must be substantially redesigned to handle Japanese adjectival morphology.

The Universal Dependencies (UD) project provides a cross-lingual annotation framework with 17 universal POS tags, designed to cover 100+ languages.

The 17 UD tags:

Tag	Name	Example (English)
NOUN	Noun	dog, city, justice
VERB	Verb	run, know, have
ADJ	Adjective	fast, old, beautiful
ADV	Adverb	quickly, very, now
DET	Determiner	the, a, this
PRON	Pronoun	she, they, it
ADP	Adposition	in, on, at, after
AUX	Auxiliary verb	can, will, have (perfect)
CCONJ	Coordinating conjunction	and, but, or
SCONJ	Subordinating conjunction	that, if, because
PART	Particle	's, not, to (infinitive)
NUM	Numeral	three, 42, VII
INTJ	Interjection	oh, wow, yes
PROPN	Proper noun	London, Google, Maria
PUNCT	Punctuation	. , ! ?
SYM	Symbol	$, %, @, =
X	Other / foreign	unanalysable tokens

Challenges of universality (#49):

Some categories (e.g. DET, PART) are absent in some languages — annotators must decide how to handle words that perform the function without fitting the category.
Language-specific subtypes are needed for many phenomena, requiring extensions to the universal framework.
Annotation decisions within the universal framework are sometimes made for consistency across languages, not because the universal analysis is linguistically optimal for any individual language.

Despite these challenges, UD corpora cover over 100 languages and have enabled significant advances in cross-lingual NLP. But the universality of the categories is an idealisation — a practical engineering choice, not a linguistic claim that all languages have the same grammatical categories.

Check Your Understanding

Select the best answer for each question.

The word 'fast' in 'She runs fast' is categorised as an adverb; in 'a fast car' it is an adjective. Which concept explains why this distinction is made based on syntactic position and morphological behaviour rather than meaning?

#48 — the functional definition of parts of speech #47 — the distributional definition of parts of speech #49 — there is no universal set of parts of speech #50 — POS extends to phrasal constituents

A multilingual NLP researcher is annotating a corpus of Mandarin Chinese with Universal Dependencies POS tags. She finds that many words can appear in positions that would be occupied by both nouns and verbs in English, without any morphological change. Which concept best explains this challenge?

#47 — the distributional definition of parts of speech #48 — the functional definition of parts of speech #49 — there is no one universal set of parts of speech #50 — POS extends to phrasal constituents

AI Dimension

Parts-of-speech concepts from this unit connect to several active areas of NLP and LLM research.

Implicit category learning. LLMs learn POS-like representations implicitly during pre-training on large text corpora. Probing studies show that attention heads and internal representations encode distributional properties that closely resemble POS categories — without any explicit POS supervision. But these representations are learned from English-dominated data and may not accurately encode the distributional patterns of under-represented languages (#47, #49).
POS tagging and distributional shift. Neural POS taggers are highly accurate on standard text but degrade sharply on non-standard domains (social media, historical text, learner language). This is a form of distributional shift: the distributional cues that define a category in news text — the contexts in which a word appears, the affixes it takes, its co-occurrence partners — may be absent or altered in social media. The drop in accuracy directly reflects concept #47 (#47).
Cross-lingual category transfer. Multilingual models trained on Universal Dependencies corpora can transfer POS-tagging ability across languages with varying success. But the reliability of this transfer depends on whether the categories in the tagset are genuinely universal — an assumption that concept #49 shows is an idealisation. For languages where adjective is a verbal category, or where noun and verb are not morphosyntactically distinct, the UD tags are approximations, and the transferred model inherits the approximation's errors (#49, #50).
Ambiguous tokenisation and POS. Subword tokenisation (introduced in Unit 2) splits words into units such as run + ning or un + expect + ed. POS labels belong to words, not subword tokens: assigning POS to subword units creates systematic errors for morphologically complex words. This is a direct consequence of concept #47 — distributional test environments apply to words, and the same tests do not apply to arbitrary subword fragments (#47, #50).

Activities

Individual task

For each of the following sentences, identify the POS of the underlined word using distributional criteria: syntactic position, morphological behaviour, and co-occurrence with determiners or auxiliaries. Explain your reasoning for each.

The light at the end of the tunnel was faint.
Please light the candle before the guests arrive.
She is wearing a light jacket today.
The team's running game improved significantly this season.
She was running late when the meeting started.

For each item: (a) state the POS label you assign; (b) identify which distributional test(s) support your decision; (c) state whether the functional definition (#48) would give the same answer or a different one.

Pair task

Use any available NLP tool or online POS tagger to tag a paragraph of 10 sentences of your choice (news article, a passage from a novel, or a set of tweets).

Examine the tagged output and identify three words that have been assigned a tag you disagree with. For each word:

State what tag was assigned by the tagger.
State what tag you would assign.
Cite the specific distributional criteria (#47) that support your choice — what syntactic position is the word in? Does it take an affix? Does it appear with a determiner or auxiliary?
Suggest whether the error is due to category ambiguity, an unusual construction, domain mismatch, or another factor.

Compare your findings with your partner. Did you identify the same errors? Where you disagreed, work out which analysis is better supported by distributional evidence.

Group task

Design a minimal POS annotation scheme for a language other than English. Choose from: Japanese, Mandarin Chinese, Turkish, or Arabic.

Your scheme should be presented as a table with the following columns:

Category label — the name you are assigning
Distributional definition — what syntactic positions and morphological properties define membership in this category in your chosen language
Example — one word or phrase in the language with gloss and translation
UD equivalent — the nearest Universal Dependencies tag, and whether it is an exact match or an approximation
Relevant concept — which of #47, #48, #49, or #50 is most relevant to the decisions made for this category

Be prepared to present: which UD categories apply straightforwardly to your language, which require adaptation, and which English categories are absent or merged in your chosen language.

Review

#47 — Parts of speech can be defined distributionally. Words belong to the same POS category if they appear in the same syntactic positions, take the same inflectional affixes, and co-occur with the same dependents. For English nouns: follow a determiner, take plural -s, function as subject or object. For English verbs: follow a modal auxiliary, take past tense -ed, appear with -ing. The distributional definition is the authoritative criterion for category assignment; it applies to any language without requiring external knowledge of what words "mean." NLP taggers implicitly apply distributional criteria learned from annotated corpora.
#48 — Parts of speech can also be defined functionally (but not metaphysically). Nouns name entities, verbs name events or states, adjectives name properties — but these are heuristics, not definitions. Gerunds are nouns that name actions; event nouns are nouns that name events; stative verbs name states, not events. The schoolroom definition of a noun as "a person, place, or thing" fails for gerunds, event nouns, and abstract nouns. Functional intuitions inform annotation guidelines but distributional criteria decide edge cases.
#49 — There is no one universal set of parts of speech, even among the major categories. Even noun and verb are not universally distinct morphosyntactic classes. Mandarin Chinese and Riau Indonesian have been argued to lack a syntactic noun/verb distinction for many words. In many languages (Yoruba, classical Chinese) what English expresses with adjectives is expressed with verbs. Japanese has two adjectival classes (i-adjectives and na-adjectives) with different distributions. Postpositions (Japanese, Turkish) and case affixes (Finnish, Russian) substitute for prepositions. Universal tagsets like Universal Dependencies are practical engineering approximations, not claims of linguistic universality.
#50 — Part of speech extends to phrasal constituents. Phrases have categories projected from their head words: an NP distributes like a noun, a VP like a verb, a PP like an adjective or adverb depending on context. This projection principle is the foundation of both phrase structure grammar and dependency grammar. In NLP, it is why parsers can treat a complex NP as a single unit regardless of its internal structure, and why POS misidentification propagates into parsing errors and attachment ambiguity.

POS tagging assigns a grammatical category label to each token in a text. It is a foundational NLP preprocessing step used in parsing, named entity recognition, machine translation, information extraction, and text-to-speech.

Ambiguity is the core challenge: many words in English (and other languages) belong to multiple categories, and the correct category for any given token can only be determined from context. Context-sensitive taggers achieve ~97% accuracy on standard English text; unigram taggers achieve ~90%.

Open vs closed classes interact with tagging challenges differently. Closed class words are highly frequent but their tags are relatively stable. Open class words carry semantic content and are ambiguous; they also expand through neologism and borrowing, creating out-of-vocabulary challenges.

Cross-linguistic challenges arise from concept #49: tagsets developed for one language (typically English) are not directly portable to languages with different morphosyntactic systems. Universal Dependencies represents the most systematic attempt to address this, with 17 universal tags and a framework for language-specific subtypes, but the universality is an idealisation — annotation decisions for any given language involve approximation. Cross-lingual model transfer inherits these approximations and may introduce systematic errors for languages whose category systems differ substantially from English.

Domain effects further reduce accuracy on social media, historical text, and learner language, where the distributional cues that define categories in standard text are absent, altered, or overridden by genre-specific conventions.

Proceed to Unit 7: Heads, Arguments, and Adjuncts when ready.

Unit 6: Parts of Speech

Learning Objectives

Core Input

Key Concepts: Distributional and Functional Definitions (#47–#48)

#47 — Parts of speech can be defined distributionally

#48 — Parts of speech can also be defined functionally (but not metaphysically)

Key Concepts: Universality and Phrasal Categories (#49–#50)

#49 — There is no one universal set of parts of speech, even among the major categories

#50 — Part of speech extends to phrasal constituents

Worked Examples

Check Your Understanding

Activities

Review

Summary: What are the four core concepts from this unit (#47–#50)?

Summary: What is POS tagging and why is it challenging cross-linguistically?