Downloading an item with multiple pages to PDF#
The loc.gov API provides structured data about Library of Congress collections in JSON and YAML formats. This notebook shows how you can take use the API to access image resources, belonging to an LoC Item, and aggregate them into a single PDF file.
Understanding API Responses Review:#
JSON Response Objects Each of the endpoint types has a distinct response format, but they can be broadly grouped into two categories:
responses to queries for a list of items, or Search Results Responses
responses to queries for a single item, or Item and Resource Responses
Furthermore, this notebook will focus on the JSON Response Object for a single item and formatting its corresponding Resources (files that make-up an item, e.g. pictures of book) into a .pdf file.
Prerequisites#
There are no prequisites in order to run this notebook, besides the installation of libraries listed in the imports section.
I. Imports#
from PIL import Image
import os
from io import BytesIO
import requests
II. Create a request URL#
First, we will start by ensuring we have a link to an item of interest. In this instance we will look at the Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, 1895-1899; Part 2, 1895-1899 as an example.
Notice the format of the link to this item: https://www.loc.gov/item/mss250640164/
item_link="https://www.loc.gov/item/mss250640164/"
request_url = item_link + "?fo=json"
# Note: The addition of the "fo=json" string ensures that the item request is in JSON format
print(f'Item API Request URL: {request_url}')
Item API Request URL: https://www.loc.gov/item/mss250640164/?fo=json
We will also set a start and end page that we want to download and compile.
start_page = 1 # starting a 1
end_page = 10 # up to and including this page, make this -1 to retrieve all pages
III. Request Data#
# Generates request from LOC API to extract data in JSON format
r = requests.get(request_url)
data = r.json()
# print(data)
# Here is a quick way at looking at the structure of the data
print("Top-level data structure:\n" + ", ".join(value for value in data.keys()))
Top-level data structure:
articles_and_essays, cite_this, item, more_like_this, options, related_items, resources, timestamp, type
IV. Resource Data and Extracting Resource URLS#
In the previous code cell, you can see that the content itself has a lot of Metadata to that can be explored. However, in this notebook we will focus on access to information about the resources.
As opposed to looking at item with data['item']
we will look at the resources through data['resources]
. Furthermore, we will be creating a list of all of the resources image urls with the best resolution (based on the largest height).
First, let’s just retrieve a list of resources/files
resources = data['resources'][0]
files = resources['files']
num_resources = len(files)
print(f'Total # of Resources: {num_resources:,}')
print('Resource Data: ' + ", ".join(key for key in resources.keys()))
# And select a subset of these files
files = files[(start_page-1):end_page]
print(f'Selected {len(files)} files')
Total # of Resources: 1,143
Resource Data: caption, files, image, url
Selected 10 files
Next we will select the highest resolution .jpg image for each file
urls = []
for i, file_sizes in enumerate(files):
# only select files that are .jpg and have a height
jpgs = [f for f in file_sizes if 'url' in f and f['url'].endswith('.jpg') and 'height' in f]
# Check to see if we have at least one .jpg image
if len(jpgs) < 1:
print(f'No .jpgs found in file #{i+1}. Skipping.')
continue
# sort the jpgs by height, descending
jpgs = sorted(jpgs, key=lambda f: -f['height'])
# choose the largest one
urls.append(jpgs[0]['url'])
print(f"Found {len(urls)} .jpg file URLs")
print("\n".join(urls))
Found 10 .jpg file URLs
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0018/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0019/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0020/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0021/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0022/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0023/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0024/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0025/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0026/full/pct:100/0/default.jpg
https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0027/full/pct:100/0/default.jpg
V. Downloading images into a PDF file#
Finally we will download each image to memory, convert them to images, then compile them into a single PDF file. You can change the resolution and file name in the code below.
# Function that facilitates the download of the image using it's url
def download_image(url):
response = requests.get(url)
return Image.open(BytesIO(response.content))
# Leveraging the list of urls, this function allows you to create the final .pdf file with the aggregate resources.
def create_pdf(image_urls, pdf_name):
images = []
for url in image_urls:
image = download_image(url)
images.append(image)
images[0].save(
pdf_name, "PDF", resolution=100.0, save_all=True, append_images=images[1:]
)
print("LOC Item Resources have been saved as pdf: "+ pdf_name)
# creating the PDF
pdf_name = 'output/sample.pdf'
create_pdf(urls, pdf_name)
LOC Item Resources have been saved as pdf: output/sample.pdf