LoC Data Package Tutorial: Austro-Hungarian maps set#

This notebook will demonstrate basic usage of using the Pyton for interacting with data packages from the Library of Congress via the Austro-Hungarian map set which is derived from the Library’s Austria-Hungary Topographic Set Maps. We will:

  1. Output a summary of the contents of this data package

  2. Read and query metadata from a data package

  3. Download and display images from a data package

Prerequisites#

In order to run this notebook, please follow the instructions listed in this directory’s README.

Output data package summary#

First, we will select the data package and output a summary of it’s contents

import io

import pandas as pd                     # for reading, manipulating, and displaying data
import requests

from helpers import get_file_stats

DATA_URL = 'https://data.labs.loc.gov/austro-hungarian-maps/' # Base URL of this data package

# Download the file manifest
file_manifest_url = f'{DATA_URL}manifest.json'
response = requests.get(file_manifest_url, timeout=60)
response_json = response.json()
files = [dict(zip(response_json["cols"], row)) for row in response_json["rows"]] # zip columns and rows

# Convert to Pandas DataFrame and show stats table
stats = get_file_stats(files)
pd.DataFrame(stats)
FileType Count Size
0 .tif 9,881 1,142.47GB

Read the metadata in a data package#

First, let’s load the data package’s metadata.csv file:

metadata_url = f'{DATA_URL}metadata.csv'
response = requests.get(metadata_url, timeout=60)
metadata_string = response.text

# Read contents as a csv file
df = pd.read_csv(io.StringIO(metadata_string), dtype=str, keep_default_na=False)
print(f'Loaded metadata file with {df.shape[0]:,} entries.')
Loaded metadata file with 9,885 entries.

Next let’s print the available properties

print(', '.join(df.columns.to_list()))
filepath, filename, parent_dir, file_format, object_type, tile_id, edition

Download and visualize GeoTIFF#

Next we will download a GeoTIFF file and visualize it.

First, let’s filter to only the GeoTIFF files and take the first one

geotiffs = df[df.file_format == 'GeoTIFF']
print(f'Found {geotiffs.shape[0]:,} GeoTIFF files.')
item = geotiffs.iloc[0]
Found 4,877 GeoTIFF files.

Next download the GeoTIFF file, display it, and visualize a specific band

from rasterio.io import MemoryFile
from rasterio.plot import show

filename = f"data/{item['parent_dir']}/{item['filename']}"
obj_url = f"{DATA_URL}{filename}"
response = requests.get(obj_url, timeout=60)
filestream = io.BytesIO(response.content)

with MemoryFile(filestream) as memfile:
     with memfile.open() as dataset:
         print(f"Opened GeoTIFF file with {dataset.count} bands and dimensions: {dataset.width:,} x {dataset.height:,}")
         show(dataset)
         visualize_band = 2
         show(dataset.read(visualize_band), transform=dataset.transform, cmap='viridis')
Opened GeoTIFF file with 3 bands and dimensions: 8,767 x 4,384
../_images/8f5f2a175654657fd82d1b760cd67d5d70ced7fe57a1e1e4974683bfb0e5dbc7.png ../_images/821e8bf6fe35b69406e57eefad2e01df0189fa3194e2920f64df2e247d40b57f.png