Bite-size ideas for the “Doing DH @ LC” Create-a-thon#

Digital Humanities Preconference, August 5, 2024#

Just a few initial seed ideas for projects, if you don’t have your own

  • Go to data.labs.loc.gov and browse the various Exploratory Data Packages that are available, read through the documentation, take a look at the visualizations, and see if anything piques your interest

  • Run through some of the Jupyter Notebook Tutorials for Data Exploration to get a sense of what’s possible

  • Get together with others at a table and create your own bite-sized ideas based on particular research questions, collections, or technical capabilities that interest you. Share it with LC staff (we’d definitely be interested!)


  • Piping LC data packages into a database tool like Datasette for SQL-like querying and exploratory visualization

  • By the People transcription datasets – asking LLMs to work with those paper datasets

  • Explore the Children’s Literature, Philosophy, or Local History Collections

    • The General Collections Assessment is an ongoing program to assess the Library’s approximately 22 million books, bound serials and other materials classified under the General Collections. As part of this project, the Library is making available for exploration the underlying bibliographic datasets used as the primary data sources for the collection assessments. https://data.labs.loc.gov/gen-coll-assessment/

  • Mappable collections:

  • MARC Library Catalog records

    • Interested in exploring library catalog records in bulk? Here are a few resources to get started!

  • A georeferencing tool like Allmaps for the Sanborns, or other Fire Insurance maps

  • Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey

    • One of the most popular collections at the Library, this collection includes the records of three National Park Survey surveys that document achievements in architecture, engineering, and landscape design. A common request is to view a map of the survey locations, but the titles (which have the most precise location information) are difficult to work with using traditional geocoders. Large language models (LLMs) potentially offer a new route for geocoding and mapping this collection. If you’re interested in this challenge, let us know! We have a spreadsheet of data from the collection, including the titles of each survey. We’ve also pulled coordinates from Wikidata, which has crowdsourced locations for a small proportion of the collection.

  • Analyzing archival Finding Aids

    • The Library currently has 3,139 archival collections described in EAD-XML finding aids, at https://findingaids.loc.gov/. Try your hand at analyzing the full text of the finding aids. You can use sentence transformers for topic modeling, or try out large language models for generating summaries and topic terms.

      • Tell us if you’re interested, and we can give you the full set of XML documents, full text extracted from the Container Lists, and a CSV of various subject fields plus collection-level narrative fields like Abstract, Scope and Contents, and Biographical/Historical notes.

  • Historical federal powerpoints:

    • Our Web Archive collections include millions of PDFs, PowerPoints, audio files, data spreadsheets, and other files attached to websites going back to the late 1990s. At data.labs.loc.gov/dot-gov/, you’ll find a data package that contains random samples of these files from .gov websites archived between 1996 and 2017. For each of seven types of files (audio, CSV, image, PDF, PowerPoint, TSV, and Excel), 1,000 files are included plus basic metadata about the files including embedded tiles, timestamps, and more. Try analyzing the text, perform image or audio analysis, or just have some fun visualizing the metadata.

  • Work with US Copyright Office bulk data:

    • https://data.copyright.gov/index.html

    • This probably requires help to work with and explain how the data is structured, but if you’re interested let us know and we can definitely help!

  • Color detection in Sanborn maps to extract material types

    • The Sanborn Maps Data Package includes over 400,000 detailed map images of United States cities and towns from the late 19th century through the mid-20th century. The images and their approximate locations are available for download at https://data.labs.loc.gov/sanborn/, and the collection is browsable at https://www.loc.gov/collections/sanborn-maps/about-this-collection/. Could image analysis of these images be used to investigate geographic and chronological trends in building materials in the U.S.? The maps color-code buildings according to their building materials. For example, this sheet of Los Angeles yellow buildings were built with wood (frame). This sheet of mostly-red buildings from the Bronx and St. Louis were built with brick. Stone and concrete buildings are typically blue, and adobe buildings are gray, and special hazard buildings are green. Try your hand at writing a script that calculates the average color of each sheet and classifies sheets by predominant color.

      • For color analysis you could use a common Python image analysis library like PILLOW or cv2

      • For cluster analysis (to group sheets by predominant color), you could use a Python library like scikit-learn

      • Maybe it’s possible to update pieces of this code from NYPL Labs, which includes color detection for material types, among other things: https://github.com/nypl-spacetime/building-inspector

Not-yet-developed bites:

  • Looking at pre-prepped OCR, and detecting bad pages

    • Detect unusable OCR

  • Prints and Photographs materials and object detection