Mapping the Canon: Quantitative Approaches to Literary History Darmstadt June 18–19, 2026

Evolution of the Poetic Canon in the Princeton Prosody Archive

01

Canonicity through use

Illustrated opening of Gray's Elegy Written in a Country Church-Yard, 1766 edition
Designs by Mr. R. Bentley for Six Poems by Mr T. Gray (London, 1766) — Gray’s Elegy Written in a Country Churchyard opening. Library collection, Anglesey Abbey, Cambridgeshire · 12.J.23

Gray’s Elegy, in pieces

Sheridan and Henderson excerpt: Elegy lines used for elocution
Sheridan & Henderson (1796), p. 3 — elocution
Gardiner excerpt: Elegy opening set to musical notation
Gardiner (1856), p. 51 — set to notation
Breen excerpt: opening line traced to Dante
Breen (1857), p. 231 — plagiarism
Brooke excerpt: couplet silently rewritten
Brooke (1880), p. 22 — lines rewritten
Skeat excerpt: single Elegy lines as metrical types
Skeat (1898), p. 488 — scansion
Hyde excerpt: Elegy fragment in a grammar exercise
Hyde (1906), p. 87 — grammar exercise

Rarely the whole poem—usually a line someone can teach with.

Authority made through use—lines cut, copied, and repurposed.

“le jeu du découpage et du collage” Compagnon, 1979

02

The Princeton Prosody Archive

The scattered history of poetic instruction

What it gathers
Prosody manuals, elocution handbooks, grammar books, schoolbooks on versification, etc.
Span
1532–1929 · searchable · public domain
Scale
~7,000 works · ~2 million pages · full text
Collections
Six, hand-curated — chiefly Literary & Linguistic

Pages per year

Princeton Prosody Archive

03

Canonicity traced in the PPA

An archive of use

Not an anthology of great literature—books that make verse legible, audible, and teachable.

Poets as instruments. Names invoked to settle a rule—which poets become usable, repeatable common ground when writers explain how verse works.

Their lines. Lines quoted to make a point concrete—cut into couplets and stanzas, made to do explanatory work, and able to travel without their poems.

3.1

Counting names

with Petr Plecháč & Artjoms Šeļa  ·  [forthcoming]

Who gets named?

A poet’s name is not a stable search object — so we resolve every mention to one identity.

  1. 01 surface forms

    • “Shakspeare”
    • “Schakespeyr”
    • “Shaxpere”
    • “Shakpeare”
    • “the sweet swan of Avon”
  2. 02 neural linking

    ReFinED mention × context
  3. 03 knowledge base

    William Shakespeare Wikidata Q692
  4. 04 countable signal

    737,194 mentions → 3,164 poets

Method

From name strings to poets

Named-entity recognition only says “Shelley is a person.” Entity linking says which Shelley — Mary or Percy Bysshe — and pulls “Mrs. Shelley,” “Mary Wollstonecraft Shelley,” and the bare surname onto one identity across two centuries.

  • the linker ReFinED, zero-shot. A neural model that detects and disambiguates each mention jointly from context, mapping it to a Wikidata QID (Dante → Q1067) with a 0–1 confidence. No corpus-specific training; ~1 day on an A100 over 1.68M pages.
  • the Wikidata trap Linking is only as clean as the knowledge base. Wikidata files Friedrich Fröbel (the kindergarten founder) and Elizabeth Barrett Browning (d. 1861) among “soccer players” and “screenwriters” — so the poet set must be filtered, not taken raw.
  • how accurate? ~98% at the 0.90 threshold. On 50 hand-annotated pages (3 annotators), 56 of 57 high-confidence links were correct. The system errs toward silence: its misses are hard OCR forms (“Bobert Bums” → Burns, “THEOCRITUS”), not fabricated poets.
  • the cascade From spans to a poet universe. 23.9M mention spans → 11.6M linked → 92,823 humans → poet-flagged (Q49757) → conf. ≥0.90 → seen in ≥2 works → born ≤1900. Noise lives in singletons — the stability gate drops 34% of names but only 0.4% of mentions.

Final poet universe: 3,164 poets · 737,194 mentions across 3,149 works, 1559–1929

A superstar economy

A handful of poets carry most of the attention.

58.9% 82.7% 90.0% 96.3% Top 1% Top 5% Top 10% Top 25%

The top 1% — just 32 poets — account for 59% of all mentions (Gini 0.92). Shakespeare, Milton, Chaucer, Homer, Virgil at the head.

A canon in motion

The leaderboard reshuffles for two centuries — yet a few anchors never leave the head.

Cumulative distinct works, by decade · top 10 · PPA Literary collection

but a name isn’t a line

Being named is not the same as having your words travel.

3.2

Counting lines

with Meredith Martin, Rebecca Sutton Koeser & Laure Thompson  ·  [under review]

What gets quoted?

A quoted poem is not a search object — so instead of searching, we let the archive collide with nearly all of English poetry, and see what aligns.

  1. 01quoted fragmentsource corpus

    The first line, “The curfew tolls the knell of parting day,” has been adopted from Dante’s Purgatorio.

    a PPA host work, 1857

    Princeton Prosody Archive 7,061
    works
  2. 02 text reuse

    passim n-gram alignment
  3. 03reference poemreference corpus

    Gray’s Elegy Chadwyck-Healey 246,724
    reference poems
  4. 04 countable signal

    252,893 distinct appearances of 37,406 poems across 3,962 PPA host works

A million alignments → an analytic corpus

passim finds far too much. Most of the method is deciding what counts.

  1. Raw passim alignments

    1,263,587

  2. Poem appearances (poem × host work)

    509,030

  3. Literary + Linguistic · reprints collapsed · 1694–1920

    274,206

  4. Implausible & misattributed matches removed

    252,893

252,893 poem appearances · 37,406 poems · 3,962 host works

Method

Finding quotations at scale

A quotation is an asymmetric match: we align the PPA, where quotations surface, against a reference body of poetry — the Chadwyck-Healey English Poetry Database, deduplicated to 246,724 reference poems, plus Shakespeare’s plays, which Chadwyck-Healey leaves out.

  • the tool passim. Built for the Viral Texts project to find reprinted passages across noisy newspaper OCR. It needs no prior guess about which texts align, and is robust to OCR error and ragged passage boundaries.
  • tuned for verse Short quotes, not long reprints. passim’s defaults assume long documentary reuse, so we lowered the bar — n-gram 25→15, min length 50→25, gap 600→300, max DF 100→10,000 — with final values chosen against a hand-annotated gold set.
  • the unit Poem appearance, not raw record. OCR noise and paraphrase split one visible quotation into several alignment records, so we count a poem’s presence in a host work once — the level every question here operates at.
  • then filter Matches aren’t yet a corpus. Host-side (Literary/Linguistic collections, collapse reprints, 1694–1920) and poem-side (hand-remove centos & adaptations, a biographical-plausibility cut) filtering follow — the counts are the analytic-corpus detour one slide up.

1,263,587 raw alignments → 509,030 appearances → 252,893 appearances of 37,406 poems across 3,962 host works

The most-hosted poems

# Work Author Appearances In the PPA
1Paradise LostJohn Milton1,3261679–1929
2HamletWilliam Shakespeare1,2871702–1929
3PsalmsKing James Bible1,1451644–1928
4Julius CaesarWilliam Shakespeare9091702–1929
5MacbethWilliam Shakespeare8771702–1929
6The Merchant of VeniceWilliam Shakespeare8441718–1928
7An Essay on CriticismAlexander Pope8331712–1929
8ProverbsKing James Bible7571644–1927
9Elegy Written in a Country ChurchyardThomas Gray7351756–1929
10JobKing James Bible6981644–1927
11Don JuanLord Byron6941810–1929
12Christus: A MysteryHenry Wadsworth Longfellow6871825–1929
⋯  38,898 more  ·  42% quoted just once  ·  Gini 0.72  ⋯
38,911The MohawkJohn D. M’Kinnon11909
38,912Seldom “can’t”Christina Georgina Rossetti11915
38,913Greater MemoryArthur William Edgar O’Shaughnessy11916
38,914To a CricketWilliam Cox Bennett11912

When a match lies

“Original Poetry”

passim finds real text reuse — but a reference “poem” isn’t always original. Chadwyck-Healey’s long tail of the mute inglorious hides centos and adaptations.

Ode to the Human Heart — Laman Blanchard, 1842

  • Blind Thamyris, and blind Mæonides,Milton
  • Pursue the triumph and partake the gale!Pope
  • Drop tears as fast as the Arabian trees,Shakespeare
  • To point a moral or adorn a tale.Johnson

Canto LXXX — Ezra Pound, 1917

The evil that men do lives after them’ Shakespeare

well, that is from Julius Caesar

unless memory trick me

who crossed the Rubicon up near Rimini

So we hand-remove centos & adaptations, then apply a biographical-plausibility filter — no work can quote a poet not yet born (it discards Pound, b. 1885, correctly).

A widening practice

Distinct poems quoted per 100 pages of prosody writing — a rate, so it climbs only if quotation itself broadened, not merely because the archive grew.

Method

Holding the archive constant

A raw count of distinct poems can’t separate a widening practice from a thickening archive — more host pages simply mean more chances to quote. So we model a rate, with the archive’s size netted out.

\[ A(y) \sim \mathrm{NegBin}(\mu_y,\, \alpha_{\mathrm{disp}}) \qquad\quad \log \mu_y = \alpha + s(y) + \log P(y) \]
  • a rate, not a count The log‑pages offset. Page volume enters with its coefficient fixed at 1 — double the host pages, double the poems, by construction. The model is then free to find the departures from that proportionality.
  • a flexible trend A natural cubic spline s(y). Free to bend where the evidence calls for it; 7 degrees of freedom, chosen by AIC. “Natural” keeps the curve from flailing at the window’s edges.
  • room to wobble Negative Binomial, not Poisson. Yearly counts vary far more than their mean — one dense anthology spikes a year — so we relax the equal‑variance assumption. αdisp = 0.5; post‑fit dispersion ≈ 1.01.

Modelled window 1694–1920  ·  1694 = first year with ≥5 PPA host works  ·  1920 = the PPA’s coverage horizon

A trajectory for every poem

Which poems did prosodists reach for more over time — and which did they quietly set aside?

\( \operatorname{logit}\,(p_{it}) = \) \( \alpha + s(t) \) shared time curve + \( b^{0}_{i} + b^{1}_{i}\, t \) poem slope + \( u_{a(i)} \) author shift

\( p_{it} \) — a poem’s prevalence: the share of a decade’s prosody works that quote it.

\( k_{it} \sim \text{Beta-Binomial}\,(n_t,\, p_{it}) \) — counts are overdispersed, so not a plain binomial.

  • Partial pooling. Thin-evidence poems are pulled toward the corpus trend; well-attested ones pull it.
  • Beta-binomial. Absorbs the superstar overdispersion a plain binomial can’t.
  • Eligibility window. Each poem is scored only in decades it could have been quoted (no left-truncation bias).

Hierarchical Bayesian · 896 sustained-presence poems (≥30 host works across ≥10 decades)

Risers, decliners, anchors

Medieval and Romantic verse climbs; neoclassical translation fades; the perennials hold.

Method

Reading each poem against the corpus

\[ \operatorname{logit}\,(p_{it}) = \underbrace{\alpha + s(t)}_{\text{shared trend}} \;+\; \underbrace{b^{0}_{i} + b^{1}_{i}\,t}_{\text{poem deviation}} \;+\; \underbrace{u_{a(i)}}_{\text{author shift}} \]
  • the input Prevalence per decade. For each eligible poem, p̂ = kit/nt — the share of that decade’s PPA host works that quote it. Fitted with a beta‑binomial rather than a binomial, because the superstar poems are overdispersed.
  • the headline number The slope b¹ᵢ. Gaining or losing ground relative to the shared trend — not in absolute terms. A positive slope outpaces the average poem; a negative one lags even as the whole practice widens around it.
  • reading the slope Log‑odds per decade. +0.25 ≈ ×1.28 odds (~28% likelier to be quoted each decade); −0.20 ≈ ×0.82 (~18% less). “Credibly” rising/declining = the 95% credible interval on b¹ excludes zero.
  • keeping it honest Partial pooling + author shift. Thin‑evidence poems are pulled toward the corpus trend; well‑attested ones pull it. ua(i) separates a poem’s own movement from its author’s standing (Shakespeare: 37 works; Gray rests almost entirely on the Elegy).

Eligibility starts 18 yrs after the poet’s birth — a Tennyson zero in the 1700s is impossibility, not neglect  ·  population = ≥30 host works across ≥10 decades → 896 poems  ·  Bambi / PyMC (NUTS), R̂ ≤ 1.02

Decliners

# Work Author / Translator Slope Host works
1Pastoral V (transl. 1697)Virgil / John Dryden−.2731
2The Art of Poetry (transl. 1680)Horace / Earl of Roscommon−.2592
3Prince Arthur (1695)Sir Richard Blackmore−.2558
4On Roscommon’s Translation of Horace (1680)Edmund Waller−.2432
5The Art of Poetry (transl. 1757)Horace / William Duncombe−.2442
6The Georgics (transl. 1753)Virgil / Joseph Warton−.2450
7Medulla Poetarum Romanorum (transl. 1737)Henry Baker−.23214
8Pastoral I (transl. 1697)Virgil / John Dryden−.2345
9The First Satire of Persius (transl. 1693)Persius / John Dryden−.2239
10An Essay on Translated Verse (1684)Earl of Roscommon−.22125
11Pastoral III (transl. 1697)Virgil / John Dryden−.2235
12Metamorphoses, Book XII (transl. 1700)Ovid / John Dryden−.2231
13Metamorphoses, Book I (transl. 1693)Ovid / John Dryden−.2148
14An Essay upon Poetry (1682)Duke of Buckingham−.2179
15Virgil’s Æneid (transl. 1740)Virgil / Christopher Pitt−.20140

Risers

# Work Author Slope Host works
1The Rime of the Ancient Mariner (1798)Samuel Taylor Coleridge+.35287
2The Daffodils (1807)William Wordsworth+.31109
3Ode: Intimations of Immortality (1807)William Wordsworth+.30263
4Tintern Abbey (1798)William Wordsworth+.26170
5The Eve of St Agnes (1820)John Keats+.24108
6Prometheus Unbound (1820)Percy Bysshe Shelley+.23128
7Is There for Honest Poverty (1795)Robert Burns+.2164
8Christabel (1816)Samuel Taylor Coleridge+.21157
9The Education of Nature (1800)William Wordsworth+.2074
10On His Blindness (1673)John Milton+.1999
11Peter Bell (1819)William Wordsworth+.1963
12To Daffodils (1648)Robert Herrick+.1955
13Sonnet 73 (1609)William Shakespeare+.1870
14To a Waterfowl (1818)William Cullen Bryant+.1871
15Ae Fond Kiss (1791)Robert Burns+.1841

Anchors

# Work Author Slope Host works
1Psalms (1611)King James Bible−.0111,110
2Proverbs (1611)King James Bible+.007738
3Job (1611)King James Bible−.005683
4Othello (1623)William Shakespeare+.017661
5King Lear (1623)William Shakespeare+.021579
6The Faerie Queene (1596)Edmund Spenser+.023560
7Henry VIII (1623)William Shakespeare+.026544
8Night Thoughts (1742–46)Edward Young−.003526
9Henry IV, Part 1 (1598)William Shakespeare+.027499
10Essay on Man, Epistle IV (1734)Alexander Pope−.018490
11King John (1623)William Shakespeare+.026467
12Essay on Man, Epistle I (1733)Alexander Pope−.019462
13Henry IV, Part 2 (1600)William Shakespeare+.009459
14Henry V (1623)William Shakespeare+.007450
15Measure for Measure (1623)William Shakespeare+.020379

Which eras gained ground

Method

From poems to periods

No new model — the same per‑poem slopes b¹ᵢ, re‑cut by the literary period each poem belongs to. Labels are Chadwyck‑Healey’s bibliographic categories (the database’s organizing logic).

  • why a period can surprise A period is a different unit from poems or authors. It can rise even when most of its authors are flat (a few rise sharply), or fall even when most of its poems hold (the high‑volume ones decline). The ridge checks whether the stories aggregate.
  • what’s included Only periods with ≥20 modelled poems. The catch‑all “Poems & Miscellanies, 1500–1900” — for poems that fit no named period — is excluded.
  • inside the Restoration ridge It’s bimodal: the classical‑translation and didactic‑verse apparatus fell hard (≈ −0.15), while Dryden’s original verse declined only mildly (≈ −0.05, intervals straddling zero).
  • inside Jacobean / Caroline Layered: short lyric and song rose, the Shakespeare plays and Milton’s longer poems held, the heroic‑narrative tradition gave way — lyricization in miniature.

Slopes in log‑odds per decade vs the shared trend  ·  Middle English sits furthest right — Chaucer and the founders the nineteenth century canonized

Selected references

  • Ayoola, Tom, et al.ReFinED: An Efficient Zero-Shot-Capable Approach to End-to-End Entity Linking.” arXiv:2207.04108, arXiv, 8 July 2022.
  • Benedict, Barbara M.Choice Reading: Anthologies, Reading Practices and the Canon, 1680–1800.” The Yearbook of English Studies, vol. 45, 2015, p. 35.
  • Bonnell, Thomas F.Bookselling and Canon-Making: The Trade Rivalry over the English Poets, 1776–1783.” Studies in Eighteenth-Century Culture, 1990.
  • Bonnell, Thomas Frank. The Most Disreputable Trade: Publishing the Classics of English Poetry, 1765–1810. Oxford University Press, 2008.
  • Capretto, Tomás, et al.Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python.” Journal of Statistical Software, 2022.
  • Compagnon, Antoine. La Seconde Main: Ou, Le Travail de la Citation. Seuil, 1979.
  • Gini, Corrado. Variabilità e Mutabilità: Contributo Allo Studio Delle Distribuzioni e Relazioni Statistiche. Cuppini, 1912.
  • Guillory, John. Cultural Capital: The Problem of Literary Canon Formation. University of Chicago Press, 1993.
  • Jackson, Virginia. Dickinson’s Misery: A Theory of Lyric Reading. Princeton University Press, 2005.
  • Karlin, Daniel. “Introduction.” English Poetry: A Bibliography of the English Poetry Full-Text Database, Chadwyck-Healey, 1995.
  • Koeser, Rebecca Sutton, et al. Princeton Prosody, from Archive to Dataset. 2026. [forthcoming]
  • Koeser, Rebecca Sutton, et al. Visualizing the Collections. 2020.
  • LeGette, Casie. Remaking Romanticism: The Radical Politics of the Excerpt. Palgrave Studies in the Enlightenment, Romanticism and the Cultures of Print, 2017.
  • Martin, Meredith. Poetry’s Data: Digital Humanities and the History of Prosody. Princeton University Press, 2025.
  • Martin, Meredith. The Rise and Fall of Meter: Poetry and English National Culture, 1860–1930. Princeton University Press, 2012.
  • McCutcheon, Mark A.The Cento, Romanticism, and Copyright.” English Studies in Canada, 2012.
  • McElreath, Richard. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed., Chapman & Hall/CRC, 2020.
  • Prins, Yopie. Victorian Sappho. Princeton University Press, 1999.
  • Rosen, Sherwin.The Economics of Superstars.” The American Economic Review, 1981.
  • Smith, David A., et al.Detecting and Modeling Local Text Reuse.” IEEE/ACM Joint Conference on Digital Libraries, 2014.
  • Thompson, Laure, et al.Princeton Prosody Archive Found Poems Dataset.” Zenodo, 2026.

Thank you

Questions & conversation

wouter.haverals@princeton.edu