Introduction

Introduction#

This tutorial provides an example of data processing and visualization using transcription data from By the People, the Library of Congress’s crowdsourced transcription program. The code, shared in Jupyter notebooks, first cleans and processes transcriptions from four By the People datasets. It then creates two visualizations from the data. By the People datasets all have the same structure, so the code in this tutorial can be modified to process and visualize additional By the People datasets.

About the data#

By the People invites anyone to contribute to the Library of Congress as a virtual volunteer by transcribing and reviewing digitized historical documents to enhance Library of Congress digital collections. The Library of Congress thanks all By the People volunteers for sharing their time and knowledge with us to make this tutorial possible.

By the People transcriptions and tags are created by anonymous and registered volunteers. Once a transcription is finished, it must be reviewed by a registered volunteer. A transcription may undergo multiple rounds of edits before being completed. Finally, transcriptions are spot-checked by Library of Congress subject matter experts before they are incorporated into the digital collections on the Library’s website (loc.gov) to enhance search and accessibility. Transcriptions are also packaged into .csv files and made available as datasets as part of the Selected Datasets Collection.

In this tutorial we will work with four datasets related to the movement for women’s suffrage in the United States:

Each dataset’s README includes additional information about its content and creation.

Transcription volunteers are instructed to transcribe the text as written, including misspellings and abbreviations. Formatting is generally not preserved with the exception of line breaks. Minimal markup does include “?” for illegible or unclear text, square brackets around deleted text, and square brackets and asterisks around marginalia ([*example*]). Pages without text are marked “nothing to transcribe” and do not have transcriptions in loc.gov.

By the People datasets contain the following fields:

Campaign – this is the highest hierarchical level in the arrangement of collections on By the People (example: Susan B. Anthony Papers). This field displays the campaign’s title.
Project – this is the second-highest hierarchical level of collections on By the People. Projects may map to an existing subset of a digital collection, such as an archival series, or may be a grouping of related items uniquely organized for By the People. This field displays the project’s title.
Item – this is the third-highest hierarchical level of collections on By the People, typically representing a folder, letter, document, or diary. This field displays the item title.
ItemId – this is the identifier for the item (see above for definition). This numerical identifier is consistent across the By the People website and in loc.gov. The item and metadata are usually located on the Library’s website at https://www.loc.gov/item/[ItemID]/
Asset – this is the identifier for the individual asset image. It is also referred to colloquially as the “page” by By the People volunteers and Community Managers. This identifier is used in the By the People site and on loc.gov.
AssetStatus – this indicates the status of the asset in the peer review workflow – Not Started, In Progress, Needs Review, or Completed. Dataset assets will always be marked as “Completed.”
DownloadURL – this link provides access to the image file for the Asset from which the transcription was created.
Transcription – this is the text created by the By the People volunteers, representing the written content of the DownloadURL image and corresponding to the Asset. This field will be blank for assets that volunteers marked “Nothing to transcribe”.
Tags – these are all the tags that have been applied to the asset. If there is more than one tag, the tags are delimited by a semicolon and space.

About the notebooks#

The Jupyter notebooks can be downloaded through a Library of Congress GitHub repository. This tutorial first cleans and processes transcriptions from the four datasets using Pandas and the spaCy Natural Language Processing library. This code tokenizes the transcriptions, breaking the strings of text into tokens (words) that will be further analyzed. It then identifies the lemma, or root, for each word. For example, the lemma of “voted” is “vote”, and the lemma of “women” is “woman”. The code next iterates over each token to produce a list of lemmas from the original transcriptions that excludes stop words, punctuation, numbers, and words that volunteers were unable to fully transcribe, which are designated with “?”. Stop words are commonly used words, such as “the”, “a”, or “is”.

The tutorial then creates two visualizations from the cleaned data using the Matplotlib and Numpy Python libraries. The first is a combined bar graph showing the five most used words for each of the four datasets. The second is a focused look at the “Speeches” series from the Susan B. Anthony Papers. With data coming from a typed inventory of speeches found in the collection, this code groups Anthony’s speeches by year, and then plots the usage of the top five words in her speeches by year.

Running the notebooks#

In order to run a Jupyter notebook, navigate to the directory that contains the notebook files using cd /path/to/dcm-btp-notebooks, then run the command jupyter notebook. This will launch the Notebook Dashboard in an Internet browser.

In order to properly run these notebooks, make sure that the appropriate Python libraries are installed. Further information can be found in the README file. The dataset files are already included in this tutorial in the data directory (along with each dataset’s README), which can be seen in the Notebook Dashboard.

An entire notebook can be run by clicking Run in the menu bar. Individual cells can be run by clicking into the cell, then hitting Shift + Enter. The notebooks in this tutorial must be run in order, from 2 to 5.

These notebooks contain optional code that can be run to print results to the notebook. This helps show what the code is doing at each step. These cells have “Optional:” in the title. Add # to the beginning of a line of code to comment out the lines you do not want to run.

The outputs from the tutorial will be saved to the outputs directory, which can be seen in the Notebook Dashboard.

Data processing#

These notebooks contain the code necessary to load and transform transcription data; they are divided into the following parts:

Import statements, function definitions, and constants,
Text tokenization for all datasets,
Data processing for the transciption data,
Visualization of the transcription data

In GitPages, these parts are accesible in the side navigational panel.

Instructions for running the code#

Run the notebooks and cells in order.
Optional code is available for previewing the data during each step of processing; these lines are not necessary for processing the data, but may be useful for those who wish to see how the data changes.

Authorship and use#

These notebooks were created by Dave Durden and Madeline Goebel, Digital Collection Specialists at the Library of Congress. They are marked with a CC0 1.0 Universal public domain dedication.

All contributions to the By the People application are released into the public domain as they are created. Anyone can use and re-use the datasets in any way they want.