Extracting location data from the loc.gov API for geovisualization#
Digital mapping has become an increasingly accessible and valuable complement to traditional interpretive narratives. Working with spatially-referenced data offers exciting possibilities for placed-based scholarship, outreach, and teaching. It’s also a perfect avenue for interdisciplinary collaboration — between, say, humanities researchers new to GIS and spatial scientists who’ve been using it for decades.
Embedded within digital collections available from the Library of Congress website are geographic data, including the locations of items and their local contexts. We can gather those data programmatically (using Python in this case) and plot them on a map, like so:
The story of these data would be incomplete, however, without a critical understanding of the history behind their collection and stewardship. In this tutorial, we demonstrate how loc.gov JSON API users can find and store spatial information from Library content with an awareness toward data quality, provenance, and why this broadened scope is important for informing research projects at the Library.
Rights and access
Rights and restrictions, including copyright, affect how you can use images, particularly if you want to publish, display, or otherwise distribute them. You can read more about copyright and other restrictions that apply to the publication/distribution of images from the Prints & Photographs Division (P&P) at this link: https://www.loc.gov/rr/print/195_copr.html
The records in the case study that follows were created for the U.S. Government and are considered to be in the public domain. It is understood that access to this material rests on the condition that should any of it be used in any form or by any means, the author of such material and the Historic American Engineering Record of the Heritage Conservation and Recreation Service at all times be given proper credit.
Data quality
Consistency and accuracy of geographic information stored with digital content on the Library of Congress website varies across and within collections. This tutorial was designed as a method of exploring existing spatial references for items. Finding and analyzing meaningful patterns from secondary data typically requires additional data corrections, contexts, and geocoding frameworks to ensure optimal coverage and accuracy. The same applies to spatial data analysis involving digital collections at the Library of Congress; in any case, time spent analyzing and interpeting data is mostly spent cleaning and grappling with the data. Rather than gloss over it, here we elect to revel in that critical stage of research to both demonstrate a particular subset of techniques, and to appreciate its nuances and implications.
Demo Data#
As a graduate student of applied urban science, I was inspired at the outset of my internship with LC Labs to discover content about the built environment across US cities on the Library website. What I found was an expansive dataset of digitized photographs, drawings and reports recognized collectively as the HHH, which includes material from three programs:
Historic American Buildings Survey (HABS)
Historic American Engineering Record (HAER)
Historic American Landscapes Survey (HALS)
More about HHH: https://www.loc.gov/collections/historic-american-buildings-landscapes-and-engineering-records/about-this-collection/
Browsing the collection:
Example of material found in the collection:
View of the uptown platform at 79th Street. Photo by David Sagarin for the Historical American Engineering Record, Library of Congress, Prints and Photographs Division, August 1978.
I decided to dig a bit deeper into the engineering record because it appeared to have the best coverage of the three for spatial references. The Historic American Engineering Record, or HAER, was established in partnership by the National Park Service (NPS), the American Society of Civil Engineers and the Library of Congress in 1969. There are more than 10,000 HAER surveys of historic sites and structures related to engineering and industry. The collection is an ongoing effort with established guidelines for documentation – HAER was created to preserve these structures through rule-based documentation, and those documents have in turn been preserved through time.
Read more about HAER Guidelines here: https://www.nps.gov/hdp/standards/haerguidelines.htm
Context#
As happens often with classifications, there are records that could rightly be placed under more than one program. There are many sites documented under the Historic American Building Surveys (HABS) that if recorded today would be assigned under HAER. Bridges are one key example. Before HAER was established, engineering-related structures, manufacturing and industrial sites, processes, watercraft, bridges, vehicles, etc. were documented under HABS. What I’ve aimed to develop here, as a result of both my own investigation as well as conversations with Library staff, is a flexible and reproducible approach to looking at the implications for scholarship of changes to preservation over time.
Geography was actually the motivating factor behind how these materials were originally organized by NPS and the Library of Congress. In the 1930s, HABS surveyors were organized into district offices that documented one or more states. The materials that they created usually (not always, which is a different problem) included state/county/city. While the Park Service played a leading role in documenting geographic data, the Library of Congress also shaped the material in the process of archiving the surveys.
The Library of Congress uses two systems to organize HABS/HAER/HALS documentation: The newer system uses the survey number as the call number. The older system assigned each survey a call number based on its location (state/county/city). Ex: HABS AL-654 has the call number: HABS ALA,1-PRAVI.V,1-
1 = Autauga County. (Each state’s counties are assigned numbers in alphabetical order.)
PRAVI = Prattville.
.v = in the vicinity of a given city/town.
1- = first place in the vicinity of Prattville surveyed.
Places documented in rural and unincorporated areas (and even some in urban areas) have good site maps/UTM/decimal degree data; some don’t. When places can’t be located, even the vicinity, or in the rare cases when an address is restricted, city centroid points are used. The National Park Service’s Cultural Resources GIS Program is currently working on a project to create an enterprise dataset that includes all HABS/HAER/HALS surveys, as they’ve done for the National Register of Historic Places.
The NPS guidelines for surveys didn’t initially include spatial data in the way that it exists today. Good site maps are the best data available for surveys from the 1930s. HABS (and HAER) guidelines were later updated to request Universal Transverse Mercator (UTM) coordinates, a global system of grid-based mapping references. All three programs now ask for decimal degree data (in order to comply with the NPS Cultural Resource Spatial Data Transfer Standards), though they still receive data in UTM (and in some cases no geographic reference at all).
Data transmitted to the Library of Congress by contributor Justine Christianson, a HAER Historian with the National Park Service, is particularly rich for our purpose of visualizing the spatial distribution of collection items. Over the past few years, she has reviewed all of the HAER records to index them and assign decimal degree coorinates. Because she has often been involved in finalizing HAER documentation before it goes to the Library she is often listed as contributor in the record metadata. Justine has created spatial data (not to mention other data improvements) for many more HAER surveys than her name is attached to. The subset chosen for this tutorial, then, reflects a certain signature of her involvement in developing standards of documentation, verifying historical reports, and performing scholarly research on material in the collection - nearly 1,500 items with accurate latitude and longitude attributes polished and preserved.
Tutorial#
The following guide will demonstrate how to plot a selection of items from the Historic American Engineering Record (HAER) on a map. With minor changes, the same process of spatial data extraction and visualization could be applied to other digital collections containing explicit geographic information at the Library of Congress.
Loading in packages
The recommended convention in Python’s own documentation is to import everything at the top, and on separate lines. For this tutorial, we’ll be importing three packages into the notebook:
To get our data from the digitized HAER collection, we’ll use the
requests
Python module to access the loc.gov JSON API.Reading in coordinates means our data needs to be re-organized - a task for the popular analysis package,
pandas
.Finally, we’ll do our visualization with
folium
to plot the locations on an interactive Leaflet map.
Folium is a Python wrapper for a tool called leaflet.js. With minimal instructions, it does a bunch of open-source Javascript work in the background, and the result is a mobile-friendly, interactive ‘Leaflet Map’ containing the data of interest.
import requests
import pandas as pd
import folium
Gathering item geography
Getting up to speed with use of the loc.gov JSON API and Python to access the collection was a breeze, thanks to existing data exploration resources located on the LC for Robots page: https://labs.loc.gov/lc-for-robots/
In specific, you can find tips on using the loc.gov JSON API from the ‘Accessing images for analysis’ notebook created by Laura Wrubel. We’ll build on in the next steps. https://github.com/LibraryOfCongress/data-exploration/blob/master/loc.gov JSON API/Accessing images for analysis.ipynb
# Many of the prints & photographs in HAER are tagged with geographic coordinates ('latlong')
# Using the requests package we imported, we can easily 'get' data for an item as JSON and parse it for our latlong:
get_any_item = requests.get("https://www.loc.gov/item/al0006/?fo=json")
print('latlong: {}'.format(get_any_item.json()['item']['latlong']))
latlong: 32.45977,-86.47767
# To retrieve this sort of data point for a set of search results, we'll first use Laura's get_image_urls function.
# This will allow us to store the web address for each item in a list, working through the search page by page.
def get_image_urls(url, items=[]):
'''
Retrieves the image_ruls for items that have public URLs available.
Skips over items that are for the collection as a whole or web pages about the collection.
Handles pagination.
'''
# request pages of 100 results at a time
params = {"fo": "json", "c": 100, "at": "results,pagination"}
call = requests.get(url, params=params)
data = call.json()
results = data['results']
for result in results:
# don't try to get images from the collection-level result
if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
# take the last URL listed in the image_url array
item = result.get("id")
items.append(item)
if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
next_url = data["pagination"]["next"]
#print("getting next page: {0}".format(next_url))
get_image_urls(next_url, items)
return items
To demonstrate with our subset of HAER listed under ‘Justine Christianson’, I’ll use a search that targets items from HAER with the her name listed as the contributor.
url = "https://www.loc.gov/search/?fa=contributor:christianson,+justine&fo=json"
# This is the base URL we will use for the API requests we'll be making as we run the function.
Now we can apply Laura’s get_image_urls function to our search results URL, formatted in JSON, to get a list of image URLs:
# retrieve all image URLs from the search results and store in a variable called 'image_urls'
image_urls = get_image_urls(url, items=[])
# how many URLs did we get?
len(image_urls)
1533
# to save on a little time, let's see what the last 100 look like
img100 = image_urls[200:300]
len(img100)
100
# create an empty set to store our latlongs
# storing in a set rather than a list eliminates any potential duplicates
spatial_set = set()
# the parameters we set for our API calls taken the first function
p1 = {"fo" : "json"}
# loop through the item URLs
for img in img100:
# make HTTP request to loc.gov API for each item
r = requests.get(img, params=p1)
# extract only from items with latlong attribute
try:
# expose in JSON format
data = r.json()
# parse for location
results = data['item']['latlong']
# add it to our running set
spatial_set.add(results)
# skip anything with missing 'latlong' data
except:
# on to the next item until we're through
pass
# show us the data!
spatial_set
{'20.9175,-156.3258333',
'30.598468,-103.892334',
'32.7253908733282,-114.616614456155',
'32.776453,-79.932086',
'32.946608,-85.984089',
'34.661847,-86.670258',
'34.7694444444,-92.2672222222',
'36.862928,-112.740538',
'37.7620889746898,-119.860731072203',
'37.80931,-122.42131',
'38.267897,-78.866949',
'38.290343,-85.821589',
'38.324232,-76.460605',
'38.578506,-77.17950472741472',
'38.578741,-77.17850169457892',
'38.578893,-77.17804715345169',
'38.590956,-77.17139696665106',
'38.684068,-77.124186',
'38.684596,-77.131794',
'38.9001118,-77.0163653',
'38.936015,-74.898445',
'39.477233,-77.7407',
'39.801913,-77.249739',
'40.13333,-76.05834',
'40.193095,-74.743825',
'40.391357,-79.85933180595799',
'40.4956,-77.47261',
'40.5766979029568,-74.4126481359476',
'40.7018369,-89.5636558',
'40.782493,-74.235895',
'41.292843,-72.92644479191337',
'41.379049,-72.097085',
'41.63922,-80.15737',
'41.650368,-93.73616221287342',
'41.885388,-87.800535',
'42.35921,-78.12151',
'42.371952,-71.059094',
'42.374807,-83.05761',
'42.380671,-83.057264',
'42.5760314,-111.7305061',
'42.90639,-78.90194',
'42.947758,-71.62145',
'43.0222765,-78.8742332',
'43.0600164,-76.1641668',
'43.08594,-70.76081',
'43.1754698,-78.6878486',
'43.1756188,-78.687883',
'43.2007619,-78.5681136',
'43.21559,-77.935409',
'43.22087,-70.85535',
'43.253566,-78.246907',
'43.2564305,-73.584969',
'43.2640534,-73.57658',
'43.548815,-73.402037',
'44.04996,-71.68708',
'44.559532,-68.799175',
'45.3180555556,-85.2583333333',
'46.687148,-89.229613',
'47.499066,-101.374345',
'47.501917,-101.429559',
'47.502163,-101.431656',
'58.0178,-152.7655',
'61.22948,-149.87333',
'62.419558,-150.1220661848771',
'63.973195,-145.72642144558296',
'65.30522,-143.15401'}
# how many unique data points were we able to gather?
len(spatial_set)
66
Pausing for reflection
So out of the sample of 100 HAER item URLs that we looped through to pull out spatial references, we ended up with a set of 75 latitude and longitude pairs. Not bad! This is certainly not perfect as far as data coverage is concerned, but given what we learned earlier about the lineage of preservation with this collection and dynamics of stewardship, I feel as though we have enough information for a meaningful demonstration and reasonable confidence in the quality of that data to proceed with the dive.
Something to notice however is how these data are currently formatted. Each latitude and longitude pair is glued together as a single string. This isn’t how Folium will want to read in coordinates, so as a next step we’ll need to rework them a bit before we get to mapping.
Data manipulation
We’ve mined out the locations of a digital subset from the HAER collection. Now we’ll restructure it with the popular pandas
package.
# convert latlong set to list
latlong_list = list(spatial_set)
# convert list to pandas dataframe
df = pd.DataFrame(latlong_list)
# split coordinates into two columns
df = df[0].str.split(',', expand=True)
# rename columns with latitude and longitude
df = df.rename(columns={0:'latitude', 1:'longitude'})
# what's the dataframe look like?
df
latitude | longitude | |
---|---|---|
0 | 43.2564305 | -73.584969 |
1 | 44.559532 | -68.799175 |
2 | 43.22087 | -70.85535 |
3 | 39.477233 | -77.7407 |
4 | 38.936015 | -74.898445 |
5 | 41.63922 | -80.15737 |
6 | 65.30522 | -143.15401 |
7 | 20.9175 | -156.3258333 |
8 | 41.379049 | -72.097085 |
9 | 47.499066 | -101.374345 |
10 | 30.598468 | -103.892334 |
11 | 38.684596 | -77.131794 |
12 | 43.08594 | -70.76081 |
13 | 38.267897 | -78.866949 |
14 | 42.380671 | -83.057264 |
15 | 41.292843 | -72.92644479191337 |
16 | 47.501917 | -101.429559 |
17 | 36.862928 | -112.740538 |
18 | 37.80931 | -122.42131 |
19 | 40.13333 | -76.05834 |
20 | 42.371952 | -71.059094 |
21 | 32.7253908733282 | -114.616614456155 |
22 | 34.661847 | -86.670258 |
23 | 62.419558 | -150.1220661848771 |
24 | 40.5766979029568 | -74.4126481359476 |
25 | 40.391357 | -79.85933180595799 |
26 | 38.590956 | -77.17139696665106 |
27 | 42.5760314 | -111.7305061 |
28 | 43.548815 | -73.402037 |
29 | 43.2640534 | -73.57658 |
... | ... | ... |
36 | 43.2007619 | -78.5681136 |
37 | 38.290343 | -85.821589 |
38 | 38.9001118 | -77.0163653 |
39 | 61.22948 | -149.87333 |
40 | 58.0178 | -152.7655 |
41 | 42.374807 | -83.05761 |
42 | 37.7620889746898 | -119.860731072203 |
43 | 45.3180555556 | -85.2583333333 |
44 | 41.650368 | -93.73616221287342 |
45 | 63.973195 | -145.72642144558296 |
46 | 47.502163 | -101.431656 |
47 | 46.687148 | -89.229613 |
48 | 40.782493 | -74.235895 |
49 | 40.193095 | -74.743825 |
50 | 32.946608 | -85.984089 |
51 | 38.578506 | -77.17950472741472 |
52 | 42.35921 | -78.12151 |
53 | 43.0600164 | -76.1641668 |
54 | 41.885388 | -87.800535 |
55 | 44.04996 | -71.68708 |
56 | 42.90639 | -78.90194 |
57 | 43.21559 | -77.935409 |
58 | 40.4956 | -77.47261 |
59 | 39.801913 | -77.249739 |
60 | 38.684068 | -77.124186 |
61 | 43.253566 | -78.246907 |
62 | 38.578893 | -77.17804715345169 |
63 | 32.776453 | -79.932086 |
64 | 34.7694444444 | -92.2672222222 |
65 | 43.1756188 | -78.687883 |
66 rows × 2 columns
Pausing for reflection
An interesting thing to note from the dataframe is the number of decimal places across our data points. Some of the coordinates are way more precise than others! This provides us with another glimpse of how changes in technology and methodology over the years can leave their trace in the digital footprint of a collection. I don’t suspect that this discrepancy will impact our ability to check out the spatial distribution of the material however.
At this stage you could export your tables of coordinates to combine with existing projects, visualize with other software, etc. I’ve “commented out” the following line since I already have a copy of “haer_sample.csv” on my machine, but if you’ve downloaded this notebook and are following along cell by cell, just remove the pound sign and run it to export the file for yourself.
# df.to_csv('haer_sample.csv')
Since we’re working in a Jupyter notebook, we can just read back in the .CSV file and make a map without leaving the page. Once we have our data back into the format we want for mapping, we’ll be ready to make our spatial visualization.
# convert spreadsheet to pandas dataframe using just the first two columns of the spreadsheet
latlong_df = pd.read_csv('files/haer_sample.csv', usecols=[1,2])
Geovisualization#
As was called out at the start of the tutorial, the open-source tool folium
builds on our earlier data wrangling with pandas
and the mapping strengths of the Leaflet.js library to create an interactive experience.
# convert pandas dataframe back to a list for folium
latlong_list = latlong_df.values.tolist()
# picking a spot in the midwest to center our map around
COORD = [35.481918, -97.508469]
# uses lat then lon - the bigger the zoom number, the closer in you get
map_haer = folium.Map(location=COORD, zoom_start=3)
# add a marker to the base leaflet map for every latlong pair in our list
for i in range(len(latlong_list)):
folium.CircleMarker(latlong_list[i], radius=1, color='#0080bb', fill_color='#0080bb').add_to(map_haer)
# calls the map into display
map_haer
Learn more about tailoring your own interactive map experience using the Folium documentation: http://folium.readthedocs.io/en/latest/
In conclusion..#
As tools for digital scholarship improve, proliferate, and take hold, we will continue to see interesting questions emerge that make use of new and existing spatial data. GIS has proven its capability to expand humanities research, however many humanists have yet to incorporate this “spatial turn” in their research. Answering questions about why and how researchers can actually use such data requires a critical understanding of these tools and their outputs.
This tutorial was developed to help beginners get things done when integrating their disciplinary information into a geospatial format. Using digital collections from the Library of Congress API, we touched on foundational data skills such as how to collect and organize historic GIS data, how to deal with data in different formats, how to clean up data, and how to visualize disciplinary data with an interactive digital map. What ties this all together though is the ability to evaluate the quality of external data with respect to the motivation, context, and change over time of its stewardship. This type of awareness, while familiar for those engaged in more traditional research, is a critical piece of any project looking to leverage ever growing and diversifying digital resources.
Credits
I’d like to thank Mary McPartland from NPS and Kit Arrington from the Library of Congress for their guidance on the HABS/HAER/HALS collections and the nuanced history (and future!) of its evolution. I’d also like to acknowledge Laura Wrubel, whose LC Labs resources were instrumental in setting up my own projects, and Meghan Ferriter, my coach and mentor throughout the LC Labs internship.