Datasets

There are some great datasets out there! In this course, we will encounter and play with some of the ones listed here below. It goes without saying that it is impossible for this list to be exhaustive. Look at it as a point of reference, but perhaps more so as a place of inspiration. Should you stumble upon a fascinating dataset or have one to recommend, please bring it to my attention so we can enrich our collection further.

One way to find datasets is to use a dedicated search engine like the Google Dataset Search or Zenodo. Other places where datasets live are GitHub, Kaggle, and the Journal of Open Humanities Data (JOHD). The website Humanities Data also offers a curated list of datasets. There’s also Archives Hub, which provides access to descriptions of archives held in UK universities and colleges.

In the spirit of John Donne’s insight that “No man is an island,” the field of Digital Humanities thrives on collaborative curation of resources. With this in mind, here are some excellent compilations of datasets put together by other scholars:

Sierra Eckert provides a curated list for an earlier iteration of this course, which is perfect for humanities-centric explorations.
Melanie Walsh offers an extensive compilation of datasets that includes everything from Nobel Laureate data to Game of Thrones character networks.
Miriam Posner curates a selection tailored for her graduate students, and it’s a goldmine for anyone in the humanities.
Alan Liu’s DH Toychest: This is a veritable digital toolbox containing a variety of data collections, from smaller text corpora to extensive archives that span documents and images.

Historical Texts and Documents

Project Gutenberg: A treasure trove of ca. 70,000 free eBooks, from classic literature to historical texts.
Gutenberg Poetry Corpus: Dataset of poetry extracted from the Project Gutenberg archives.
Shakespeare’s Works: Comprehensive access to the Bard’s plays and poems, available in multiple digital formats (doc, txt, pdf, xml).
DraCor: A growing collection of plays in (mostly) European languages, all encoded in TEI.
Early English Books Online: A rich collection of titles from the dawn of printing in England until the 17th century.
HathiTrust Digital Library: A repository of digitized works from research libraries around the world.
Internet Archive: A digital library offering books, movies, software, and more.
Google Books: Search and preview millions of books from libraries and publishers worldwide.
The Black Short Story Dataset: A collection of over 600 African American short stories found in 100 anthologies published from 1925 to 2017.
Early Novels Database: This collection provides metadata for over 2,000 novels published between 1660 and 1850, allowing for a comprehensive look at the evolution of the novel during this period. Access the complete metadata dataset, as well as a smaller collection focusing on 25 novels, which also includes their full-texts.
Old Bailey Online: Detailed accounts of over 197,000 criminal trials at London’s central criminal court. The dataset includes the text of the trials, biographical information about the accused, and other details. Take also a look at the website of the Digital Panopticon, which brings together various research results on the lives of 90,000 convicts from the Old Bailey.
The Inaugural Address Corpus: 57 texts of U.S. presidential inaugural addresses.
Feeding America: The Historic American Cookbook Dataset: 76 influential American cookbooks from MSU Libraries’ special collections, representing various periods and themes in American cookbook history.
REED Online (Records of Early English Drama): A collection of the surviving records of drama, secular music, and other popular entertainment in England before 1642.
DocSouth Data: A collection of datasets and texts from the University of North Carolina’s Documenting the American South project: texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.
SlaveVoyages: Collaborative digital initiative that compiles and makes publicly accessible records of the largest slave trades in history.
The Survey of Scottish Witchcraft: A database of people accused of witchcraft in Scotland between 1563 and 1736.
ToposText: An indexed compilation of ancient texts and mapped places significant to the history and mythology of the ancient Greeks, covering periods from the Neolithic era to the 2nd century CE.
The Chinese Text Project: A digital library of pre-modern Chinese texts. With over 30,000 titles and more than 5 billion characters, it is the largest database of pre-modern Chinese texts in existence.
Eighteenth-Century Poetry Archive (ECPA): Collaborative digital archive and research project devoted to the poetry of the long eighteenth century. ECPA builds on the electronic texts created by the Text Creation Partnership from Gale’s Eighteenth Century Collections Online (ECCO).
Coptic Scriptorium: A digital library of texts and tools for the study of Coptic literature. Over 1,278,500 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works.
The Digital Library of the Middle East: Large collection of digital texts, images, and other media from the Middle East, including manuscripts, maps, and photographs.
The Digital Library of the Caribbean: A cooperative digital library for resources from and about the Caribbean and circum-Caribbean.
French Revolution Digital Archive: A collection of primary source documents from the French Revolution, including the Parliamentary archives and images of the French Revolution.
Victorian Women Writers Project: Project primarily concerned with the exposure of lesser-known British women writers of the 19th century. The collection represents an array of genres - poetry, novels, children’s books, political pamphlets, religious tracts, histories, and more.
American Verse Project: Electronic archive of volumes of American poetry prior to 1920. The full text of each volume of poetry was converted into digital form and coded adhering to the TEI guidelines.
African American Literature Text Corpus: Text Corpus of African American Fiction and Poetry, from 1853-1923.
The Westminster Detective Library: A collection of hundreds of detective stories, all printed before 1891 (i.e. pre-Sherlock Holmes!).
The First World War Poetry Digital Archive: A repository of over 7,000 items of text, images, audio, and video for teaching, learning, and research.
Broadside Ballads Online from the Bodleian Libraries: Digital collection of English printed ballad-sheets from between the 16th and 20th centuries, linked to other resources for the study of the English ballad tradition.
European Literary Text Collection: ELTeC is a collection of corpora of literary texts that are comparable in nature, scope and quality across several European languages. Its availability is an essential condition for the creation, evaluation and use of multilingual tools and methods of analysis for literary texts.
PoeTree: PoeTree is a standardized collection of poetry corpora comprising over 330,000 poems in ten languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian, Spanish).

Newspapers and Magazines

Chronicling America: Historic American newspapers.
British Newspaper Archive: Digital archive of British historical newspapers.
Trove - Australian Newspapers: Digitized version of various Australian newspapers.
Historic Mexican & Mexican American Press: A collection of historic Mexican and Mexican American publications from the mid-1800s to the 1970s.
The New York Times Annotated Corpus: A corpus of New York Times articles from 1987 to 2007, annotated with hand-annotated linguistic and semantic tags.
National Geographic Covers: A dataset of National Geographic magazine covers from 1960 to 2018.

Art and Image Collections

The Met Collection API: Access to The Met’s art collection.
The Museum of Modern Art (MoMA) Collection: This research dataset contains 140,848 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in its database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the museum.
The Rijksmuseum API: Access art from the Rijksmuseum in Amsterdam.
Europeana Collections: Digital access to European cultural heritage artifacts.
Smithsonian Open Access: Explore digitized images, texts, videos, and sound recordings from the Smithsonian’s collections.
David Rumsey Historical Map Collection: A wonderful collection of historical maps.
The Digital Cicognara Library: Digital collection of early literature on art and archaeology, replicating and expanding the original 5,000-volume library of Leopoldo Cicognara held at the Vatican. The texts are primarily in Italian, French, English, German, and Latin.
Flickr Creative Commons: Many Flickr users have chosen to offer their work under the very permissive Creative Commons license, and you can browse or search through content under each type of license.
The British Library’s Flickr Commons: A collection of over a million public domain images from the British Library.
Graphic Novel Corpus: 253 graphic narratives written in English and published in the United States, Great Britain, Canada, and India.
The New Yorker Cartoon Caption Contest Dataset: A dataset of New Yorker cartoons and the captions that were submitted to the magazine’s weekly caption contest.
USDA Pomological Watercolor Collection: A collection of 7,584 watercolors, painted between 1886 and 1942 and used by the U.S. Department of Agriculture to document existing fruit and nut cultivars.

Music and Audio Archives

Internet Archive’s Audio Archive: Access to music recordings, audiobooks, podcasts, and radio programs.
BBC Sound Effects: Over 16,000 sound effects from the BBC.
Billboard Hot 100 Lyrics: Kaylin Walker’s dataset captures five decades of pop music lyrics for textual analysis.
The Million Song Dataset: A repository of audio features and metadata for a million contemporary popular music tracks.
The Free Music Archive: 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists. 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists.
VoxPopuli Speech Corpus: Audio and transcriptions of speeches from European Parliament sessions spanning 2009-2020. It includes data for 18 languages and features 29 hours of non-native English speech.
Elders Project: The Baldwin-Emerson Elders Project is a nationwide initiative focused on preserving the histories of Black, Latine, Asian, Indigenous, and queer elders.

Movie and Dialogue Datasets

Cornell Movie-Dialogs Corpus: Large metadata-rich collection of fictional conversations extracted from raw movie scripts.
MovieLens: A collection of movie ratings and metadata, perfect for those interested in recommender systems and collaborative filtering.
IMDb Non-Commercial Datasets: These datasets from IMDb offer detailed information on movies, TV shows, cast, crew, ratings, and more.
OpenSubtitles: Movie and TV subtitles, perfect for textual analysis, language modeling, and translation studies, offering insights into dialogue trends, language use, and cultural references.
The Movies Dataset: A collection of metadata on over 45,000 movies, including 26 million ratings from over 270,000 users.
Bechdel Test Film Dataset: An analysis by FiveThirtyEight on gender bias in films. This dataset underpins a 2014 article that investigates Hollywood’s gender disparities through the lens of the Bechdel Test.
Early African-American Film Database, 1909–1930: This dataset focuses on silent “race films” created before 1930 featuring African-Americans for primarily African-American audiences. It compiles records on films, actors, production companies, and other elements of the early race film industry.
Skate Video Dataset: A collection of metadata on skateboarding videos, including information on the skaters, and the music used in the videos. This dataset was used in an essay by The Pudding – The Good, the Rad, and the Gnarly – published in June 2018.
Kinolab: The Kinolab platform invites users into the collections via five principal entry points: Films, Series, Directors, Genres, and Tags. The terminus of each of these pathways is the individual clip page, where users can view a clip and its associated film language tags which link to other clips in the collection sharing the same tag.

Datasets curated and maintained at Princeton University 🐯

Princeton University’s Center for Digital Humanities (CDH): The CDH is a hub for digital humanities research, teaching, and learning at Princeton University. Check out their datasets and projects!
Princeton University’s Special Collections: Portal to Digital Humanities projects carried out at Princeton Univeristy, taking full advantages of some wonderful materials housed at Special Collections. Also check out this overarching search tool for Princeton University’s databases.
Princeton Prosody Archive: A collection of digitized texts for the study of prosody.
Derrida’s Library Annotations: An intimate look into the margins of Jacques Derrida’s personal library, where the philosopher’s handwritten notes offer a unique perspective on his thoughts and work.
Virgin Mary Tales: A narrative compilation that spans centuries, documenting tales of the Virgin Mary within Ethiopia, Eritrea, and Egypt from 1300 to the present.
Shakespeare & Company Lending Library Records: The borrowing records from the iconic English-language lending library in 1920s-1930s Paris.
Princeton University Art Museum: The Princeton University Art Museum collection includes over 113,000 works of art that range from ancient to contemporary art and span the globe. The museum’s API offers digital access to detailed collection data. This includes information about each artwork, like the creator, creation date, physical dimensions, medium used, and analytical insights from curators or scholars. For more information, see also this excellent blog post.
Papers of Princeton: A collection of digitized Princeton University student newspapers, including The Daily Princetonian, The Woman’s Newspaper of Princeton, and The Nassau Literary Review.

Miscellaneous

North American Comics Metadata: Dive into the metadata from the Michigan State University Library Comics Art Collection.
NYPL’s Menu Collection: A taste of history through the New York Public Library’s transcribed menus, accompanied by a glossary.
The Office (U.S.) Dialogue Dataset: Analyze the dialogue from the popular TV show “The Office”.
The Simpsons Characters Data: A dataset of ca. 40K images of 20 different Simpsons characters.
Superheroes Dataset: A dataset containing information on various superheroes and their attributes.
Doctor Who Villain List: Doctor Who has fought more than 400 villains and monsters. Find out here who they are - and which appeared most often.
Shipwreck Database: Based on A.J. Parker’s 1992 work Ancient Shipwrecks of the Mediterranean and the Roman Provinces, this dataset provides geographic and archaeological details of 1,368 documented shipwrecks.
Dog Names in New York City: A dataset of dog names registered in New York City (what did you expect?).
Western Europe 650 Year Grape Harvest Date Database: A dataset of grape harvest dates from 1354 to 2006, compiled from a variety of sources.
Museum Salary Transparency: A crowdsourced spreadsheet that collects information on museum salaries.
All Places of Worship in the United States: A dataset of all places of worship in the United States, including their names, addresses, and geographical coordinates.
The COVID Tracking Project: A volunteer organization launched from The Atlantic that collects and publishes the most complete testing data available for U.S. states and territories.
COVID-19 Open Research Dataset (CORD-19): Corpus of academic papers about COVID-19 and related coronavirus research, curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.
International Database and Gallery of Structures: Database of structures, including bridges, buildings, cathedrals, rollercoasters, towers and other large structures, with detailed information on their design, construction, and history.
Project Arctic Shift: Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.