LoC Data Package Tutorial: Sanborn Maps collection#
This notebook will demonstrate basic usage of using Python for interacting with data packages from the Library of Congress via the Sanborn Maps data package which is derived from the Library’s Sanborn Maps collection. We will:
Prerequisites#
In order to run this notebook, please follow the instructions listed in this directory’s README.
Output data package summary#
First, we will select Sanborn Maps data package and output a summary of it’s contents
import io
import pandas as pd # for reading, manipulating, and displaying data
import requests
from helpers import get_file_stats
DATA_URL = 'https://data.labs.loc.gov/sanborn/' # Base URL of this data package
# Download the file manifest
file_manifest_url = f'{DATA_URL}manifest.json'
response = requests.get(file_manifest_url, timeout=300)
response_json = response.json()
files = [dict(zip(response_json["cols"], row)) for row in response_json["rows"]] # zip columns and rows
# Convert to Pandas DataFrame and show stats table
stats = get_file_stats(files)
pd.DataFrame(stats)
FileType | Count | Size | |
---|---|---|---|
0 | .jpg | 607,807 | 205.28GB |
1 | .txt | 85 | 107MB |
Query the metadata in a data package#
Next we will download a data package’s metadata, print a summary of the items’ State values, then filter by a particular State.
All data packages have a metadata file in .json and .csv formats. Let’s load the data package’s metadata.json
file:
metadata_url = f'{DATA_URL}metadata.json'
response = requests.get(metadata_url, timeout=60)
data = response.json()
print(f'Loaded metadata file with {len(data):,} entries.')
Loaded metadata file with 50,600 entries.
Next let’s convert to pandas DataFrame and print the available properties
df = pd.DataFrame(data)
print(', '.join(df.columns.to_list()))
Date, Digitized, Id, IIIF_manifest, Language, Last_updated_in_api, Location_text, City_text, County_text, State_text, Country_text, Location_secondary_text, Mime_type, Notes, Number_of_files, Online_format, Original_format, Part_of, Preview_url, Repository, Source_collection, Subject_headings, Title, Type_of_resource, Location
Next print the top 10 most frequent States in this dataset
# Since "State_text" is a list, we must "explode" it so there's just one state per row
# We convert to DataFrame so it displays as a table
df['State_text'].explode().value_counts().iloc[:10].to_frame()
State_text | |
---|---|
New York | 3693 |
Pennsylvania | 3056 |
Illinois | 2878 |
California | 2641 |
Ohio | 2412 |
Texas | 2238 |
Michigan | 1962 |
Indiana | 1709 |
Oklahoma | 1707 |
Missouri | 1700 |
Now we filter the results to only those items with State “Oklahoma”
df_by_state = df.explode('State_text')
subset = df_by_state[df_by_state.State_text == 'Oklahoma']
print(f'Found {subset.shape[0]:,} items with state "Oklahoma"')
Found 1,707 items with state "Oklahoma"
Download and display images#
Finally we will download and display the images associated with the first item in the Oklahoma subset.
First we will merge the metadata with the file manifest to link the file URL to the respective item.
df_files = pd.DataFrame(files)
subset_with_images = pd.merge(subset, df_files, left_on='Id', right_on='item_id', how='left')
print(f'Found {subset_with_images.shape[0]:,} Oklahoma image files')
Found 7,788 Oklahoma image files
Finally, let’s output the first four images as a grid
import math
from IPython.display import display # for displaying images
from PIL import Image # for creating, reading, and manipulating images
count = 4
images = subset_with_images.head(count).reset_index()
# Define image dimensions
image_w = 600
image_h = 600
cols = math.ceil(count / 2.0)
rows = math.ceil(count / 2.0)
cell_w = image_w / cols
cell_h = image_h / rows
# Create base image
base_image = Image.new("RGB", (image_w, image_h))
# Loop through image URLs
for i, row in images.iterrows():
file_url = f'https://{row["object_key"]}'
# Downoad the image to memory
response = requests.get(file_url, timeout=60)
image_filestream = io.BytesIO(response.content)
# And read the image data
im = Image.open(image_filestream)
# Resize it as a thumbnail
im.thumbnail((cell_w, cell_h))
tw, th = im.size
# Position it
col = i % cols
row = int(i / cols)
offset_x = int((cell_w - tw) * 0.5) if tw < cell_w else 0
offset_y = int((cell_h - th) * 0.5) if th < cell_h else 0
x = int(col * cell_w + offset_x)
y = int(row * cell_h + offset_y)
# Paste it
base_image.paste(im, (x, y))
# Display the result
display(base_image)