Datasets

There are some great datasets out there! In this course, we will encounter and play with some of the ones listed here below. It goes without saying that it is impossible for this list to be exhaustive. Look at it as a point of reference, but perhaps more so as a place of inspiration. Should you stumble upon a fascinating dataset or have one to recommend, please bring it to my attention so we can enrich our collection further.

One way to find datasets is to use a dedicated search engine like the Google Dataset Search or Zenodo. Other places where datasets live are GitHub, Kaggle, and the Journal of Open Humanities Data (JOHD). The website Humanities Data also offers a curated list of datasets. There’s also Archives Hub, which provides access to descriptions of archives held in UK universities and colleges.

In the spirit of John Donne’s insight that “No man is an island,” the field of Digital Humanities thrives on collaborative curation of resources. With this in mind, here are some excellent compilations of datasets put together by other scholars:

Historical Texts and Documents

  • Project Gutenberg: A treasure trove of ca. 70,000 free eBooks, from classic literature to historical texts.
  • Gutenberg Poetry Corpus: Dataset of poetry extracted from the Project Gutenberg archives.
  • Shakespeare’s Works: Comprehensive access to the Bard’s plays and poems, available in multiple digital formats (doc, txt, pdf, xml).
  • DraCor: A growing collection of plays in (mostly) European languages, all encoded in TEI.
  • Early English Books Online: A rich collection of titles from the dawn of printing in England until the 17th century.
  • HathiTrust Digital Library: A repository of digitized works from research libraries around the world.
  • Internet Archive: A digital library offering books, movies, software, and more.
  • Google Books: Search and preview millions of books from libraries and publishers worldwide.
  • The Black Short Story Dataset: A collection of over 600 African American short stories found in 100 anthologies published from 1925 to 2017.
  • Early Novels Database: This collection provides metadata for over 2,000 novels published between 1660 and 1850, allowing for a comprehensive look at the evolution of the novel during this period. Access the complete metadata dataset, as well as a smaller collection focusing on 25 novels, which also includes their full-texts.
  • Old Bailey Online: Detailed accounts of over 197,000 criminal trials at London’s central criminal court. The dataset includes the text of the trials, biographical information about the accused, and other details. Take also a look at the website of the Digital Panopticon, which brings together various research results on the lives of 90,000 convicts from the Old Bailey.
  • The Inaugural Address Corpus: 57 texts of U.S. presidential inaugural addresses.
  • Feeding America: The Historic American Cookbook Dataset: 76 influential American cookbooks from MSU Libraries’ special collections, representing various periods and themes in American cookbook history.
  • REED Online (Records of Early English Drama): A collection of the surviving records of drama, secular music, and other popular entertainment in England before 1642.
  • DocSouth Data: A collection of datasets and texts from the University of North Carolina’s Documenting the American South project: texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.
  • SlaveVoyages: Collaborative digital initiative that compiles and makes publicly accessible records of the largest slave trades in history.
  • The Survey of Scottish Witchcraft: A database of people accused of witchcraft in Scotland between 1563 and 1736.
  • ToposText: An indexed compilation of ancient texts and mapped places significant to the history and mythology of the ancient Greeks, covering periods from the Neolithic era to the 2nd century CE.
  • The Chinese Text Project: A digital library of pre-modern Chinese texts. With over 30,000 titles and more than 5 billion characters, it is the largest database of pre-modern Chinese texts in existence.
  • Eighteenth-Century Poetry Archive (ECPA): Collaborative digital archive and research project devoted to the poetry of the long eighteenth century. ECPA builds on the electronic texts created by the Text Creation Partnership from Gale’s Eighteenth Century Collections Online (ECCO).
  • Coptic Scriptorium: A digital library of texts and tools for the study of Coptic literature. Over 1,278,500 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works.
  • The Digital Library of the Middle East: Large collection of digital texts, images, and other media from the Middle East, including manuscripts, maps, and photographs.
  • The Digital Library of the Caribbean: A cooperative digital library for resources from and about the Caribbean and circum-Caribbean.
  • French Revolution Digital Archive: A collection of primary source documents from the French Revolution, including the Parliamentary archives and images of the French Revolution.
  • Victorian Women Writers Project: Project primarily concerned with the exposure of lesser-known British women writers of the 19th century. The collection represents an array of genres - poetry, novels, children’s books, political pamphlets, religious tracts, histories, and more.
  • American Verse Project: Electronic archive of volumes of American poetry prior to 1920. The full text of each volume of poetry was converted into digital form and coded adhering to the TEI guidelines.
  • African American Literature Text Corpus: Text Corpus of African American Fiction and Poetry, from 1853-1923.
  • The Westminster Detective Library: A collection of hundreds of detective stories, all printed before 1891 (i.e. pre-Sherlock Holmes!).
  • The First World War Poetry Digital Archive: A repository of over 7,000 items of text, images, audio, and video for teaching, learning, and research.
  • Broadside Ballads Online from the Bodleian Libraries: Digital collection of English printed ballad-sheets from between the 16th and 20th centuries, linked to other resources for the study of the English ballad tradition.
  • European Literary Text Collection: ELTeC is a collection of corpora of literary texts that are comparable in nature, scope and quality across several European languages. Its availability is an essential condition for the creation, evaluation and use of multilingual tools and methods of analysis for literary texts.
  • PoeTree: PoeTree is a standardized collection of poetry corpora comprising over 330,000 poems in ten languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian, Spanish).

Newspapers and Magazines

Art and Image Collections

  • The Met Collection API: Access to The Met’s art collection.
  • The Museum of Modern Art (MoMA) Collection: This research dataset contains 140,848 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in its database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the museum.
  • The Rijksmuseum API: Access art from the Rijksmuseum in Amsterdam.
  • Europeana Collections: Digital access to European cultural heritage artifacts.
  • Smithsonian Open Access: Explore digitized images, texts, videos, and sound recordings from the Smithsonian’s collections.
  • David Rumsey Historical Map Collection: A wonderful collection of historical maps.
  • The Digital Cicognara Library: Digital collection of early literature on art and archaeology, replicating and expanding the original 5,000-volume library of Leopoldo Cicognara held at the Vatican. The texts are primarily in Italian, French, English, German, and Latin.
  • Flickr Creative Commons: Many Flickr users have chosen to offer their work under the very permissive Creative Commons license, and you can browse or search through content under each type of license.
  • The British Library’s Flickr Commons: A collection of over a million public domain images from the British Library.
  • Graphic Novel Corpus: 253 graphic narratives written in English and published in the United States, Great Britain, Canada, and India.
  • The New Yorker Cartoon Caption Contest Dataset: A dataset of New Yorker cartoons and the captions that were submitted to the magazine’s weekly caption contest.
  • USDA Pomological Watercolor Collection: A collection of 7,584 watercolors, painted between 1886 and 1942 and used by the U.S. Department of Agriculture to document existing fruit and nut cultivars.

Music and Audio Archives

  • Internet Archive’s Audio Archive: Access to music recordings, audiobooks, podcasts, and radio programs.
  • BBC Sound Effects: Over 16,000 sound effects from the BBC.
  • Billboard Hot 100 Lyrics: Kaylin Walker’s dataset captures five decades of pop music lyrics for textual analysis.
  • The Million Song Dataset: A repository of audio features and metadata for a million contemporary popular music tracks.
  • The Free Music Archive: 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists. 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists.
  • VoxPopuli Speech Corpus: Audio and transcriptions of speeches from European Parliament sessions spanning 2009-2020. It includes data for 18 languages and features 29 hours of non-native English speech.
  • Elders Project: The Baldwin-Emerson Elders Project is a nationwide initiative focused on preserving the histories of Black, Latine, Asian, Indigenous, and queer elders.

Movie and Dialogue Datasets

  • Cornell Movie-Dialogs Corpus: Large metadata-rich collection of fictional conversations extracted from raw movie scripts.
  • MovieLens: A collection of movie ratings and metadata, perfect for those interested in recommender systems and collaborative filtering.
  • IMDb Non-Commercial Datasets: These datasets from IMDb offer detailed information on movies, TV shows, cast, crew, ratings, and more.
  • OpenSubtitles: Movie and TV subtitles, perfect for textual analysis, language modeling, and translation studies, offering insights into dialogue trends, language use, and cultural references.
  • The Movies Dataset: A collection of metadata on over 45,000 movies, including 26 million ratings from over 270,000 users.
  • Bechdel Test Film Dataset: An analysis by FiveThirtyEight on gender bias in films. This dataset underpins a 2014 article that investigates Hollywood’s gender disparities through the lens of the Bechdel Test.
  • Early African-American Film Database, 1909–1930: This dataset focuses on silent “race films” created before 1930 featuring African-Americans for primarily African-American audiences. It compiles records on films, actors, production companies, and other elements of the early race film industry.
  • Skate Video Dataset: A collection of metadata on skateboarding videos, including information on the skaters, and the music used in the videos. This dataset was used in an essay by The Pudding – The Good, the Rad, and the Gnarly – published in June 2018.
  • Kinolab: The Kinolab platform invites users into the collections via five principal entry points: Films, Series, Directors, Genres, and Tags. The terminus of each of these pathways is the individual clip page, where users can view a clip and its associated film language tags which link to other clips in the collection sharing the same tag.

Datasets curated and maintained at Princeton University 🐯

Miscellaneous