Color clusters in images in Library of Congress collections#
Color analysis is a fascinating way to explore and identify patterns in a group of images. The Library of Congress has thousands of digitized images online for the public. Let’s take a look at the colors in JPEG images.
We’re going to use the Loc.gov JSON API to get the metadata about each collection’s items. We will:
Version#
Version: 2
Last Run: July 11, 2025 (Python 3.12)
Author Information:
Written by Laura Wrubel, Visiting Scholar 2018
Edited by Sabrina Templeton, Junior Fellow 2025
Prerequisites#
Note that running this notebook will download files onto your machine.
Python packages that may require install if not already present in your environment. Install can be done with pip install [package]
:
numpy
pillow
scipy
Get image URLs for JPEGs in a specific collection#
First, we need to get a list URLs for items in a specific collection and fetch the first JPEG listed for each item. We’ll pull from the image file list on the search result record, which contains a partial list of images (JPEGs and GIFs) for each search result. To get the JPEG image, we’re taking the first URL with “.jpg” in the “image_url” field. Sometimes there are multiple image sizes available, but the first one is generally the smallest.
import requests
import time
The function below takes as input the URL for a collection. Be sure to include ?fo=json
at the end so that you’re getting JSON, not the HTML for the web page. For example: https://www.loc.gov/collections/sanborn-maps/?fo=json
def get_image_urls(url, images=[], item_urls=[]):
'''
Retrieves the image URLs for items that have public URLs available.
Skips over items that are for the colletion as a whole or web pages about the collection.
Handles pagination.
Args:
url (str): The URL to request a collection.
images (list, optional): The list that fetched images will get added to.
item_urls (list, optional): The list that fetched item URLs will get added to.
Returns:
list: The images from the collection.
list: The item URLS from the collection.
'''
call = requests.get(url)
data = call.json()
results = data['results']
for result in results:
# don't get image from the collection-level result
if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
if result.get("image_url"):
for url in result.get("image_url"):
# get the first image URL that contains ".jpg"
if ".jpg" in url:
images.append(url)
item_url = result.get("id")
item_urls.append(item_url)
break
else:
# some items don't have images available
print("problem result, has no image_url: {0}.\n {1}".format(result.get("id"), result))
if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
next_url = data["pagination"]["next"]
time.sleep(3) # timer to avoid API rate limits
get_image_urls(next_url, images, item_urls)
return images, item_urls
Let’s get the URLs for the first JPEGs for each Baseball Cards collection item. The collection has over 2000 items total, so in order to make running this notebook more manageable for now we will only grab a smaller subset by filtering to cards from the year 1888. If you want to proceed with the entire collection, simply remove the dates=
filter at the end.
image_list, item_urls = get_image_urls("https://www.loc.gov/collections/baseball-cards/?fo=json&dates=1888")
How many JPEG images did we identify in the Baseball Cards collection?
len(image_list)
58
And let’s confirm that we have the same number of item URLs:
len(item_urls)
58
Analyzing the images for color clusters#
I’m adapting code from the colorz.py script in metakirby5’s the colorz code, which is a command line tool for a k-means color scheme generator. It runs an analysis on an image and presents an HTML page with swatches.
It uses scipy’s kmeans clustering algorithm to determine clusters of colors. Because of the way k-means clustering works, you might get different values each time you run it.
Since I wanted to display colors close to the actual colors in the images, I removed the script’s option to “brighten” colors. The script has the option to “clamp” the value of the colors, which adjusts them into a small range of values. This reduces the differences between the values, so you could see differences in hue more easily. I’ve chosen not to clamp the values for this data.
Colorz requires that you install the scipy and Pillow packages in your Python environment–see the prerequisites section for more details.
from PIL import Image
from sys import exit
from io import BytesIO
from colorsys import rgb_to_hsv, hsv_to_rgb
from scipy.cluster.vq import kmeans
from numpy import array
DEFAULT_NUM_COLORS = 6
# default minimum and maximum values are used to clamp the color values to a specific range
# originally this was set to 170 and 200, but I'm running with 0 and 256 in order to
# not clamp the values. This can also be set as a parameter.
DEFAULT_MINV = 0
DEFAULT_MAXV = 256
THUMB_SIZE = (200, 200)
SCALE = 256.0
def down_scale(x):
return x / SCALE
def up_scale(x):
return int(x * SCALE)
def clamp(color, min_v, max_v):
"""
Clamps a color such that the value (lightness) is between min_v and max_v.
"""
# use down_scale to convert color to value between 0-1 as expected by rgb_hsv
h, s, v = rgb_to_hsv(*map(down_scale, color))
# also convert the min_v and max_v to values between 0-1
min_v, max_v = map(down_scale, (min_v, max_v))
# get the maximum of the min value and the color's value (therefore bumping it up if needed)
# then get the minimum of that number and the max_v (bumping the value down if needed)
v = min(max(min_v, v), max_v)
# convert the h, s, v(which has been clamped) to rgb, apply upscale to get it back to 0-255, return tuple R,G,B
return tuple(map(up_scale, hsv_to_rgb(h, s, v)))
def order_by_hue(colors):
"""
Orders colors by hue.
"""
hsvs = [rgb_to_hsv(*map(down_scale, color)) for color in colors]
hsvs.sort(key=lambda t: t[0])
return [tuple(map(up_scale, hsv_to_rgb(*hsv))) for hsv in hsvs]
def get_colors(img):
"""
Returns a list of all the image's colors.
"""
w, h = img.size
# convert('RGB') converts the image's pixels info to RGB
# getcolors() returns an unsorted list of (count, pixel) values
# w * h ensures that maxcolors parameter is set so that each pixel could be unique
# there are three values returned in a list
return [color for count, color in img.convert('RGB').getcolors(w * h)]
def hexify(rgb):
return "#{0:02x}{1:02x}{2:02x}".format(*rgb)
def colorz(image_url, n=DEFAULT_NUM_COLORS, min_v=DEFAULT_MINV, max_v=DEFAULT_MAXV,
order_colors=True):
"""
Get the n most dominant colors of an image.
Clamps value to between min_v and max_v.
Total number of colors returned is n, optionally ordered by hue.
Returns as a list of RGB triples.
"""
try:
r = requests.get(image_url)
except ValueError:
print("{0} was not a valid URL.".format(image_url))
exit(1)
img = Image.open(BytesIO(r.content))
img.thumbnail(THUMB_SIZE) # replace with a thumbnail with same aspect ratio, no larger than THUMB_SIZE
obs = get_colors(img) # gets a list of RGB colors (e.g. (213, 191, 152)) for each pixel
# adjust the value of each color, if you've chosen to change minimum and maximum values
clamped = [clamp(color, min_v, max_v) for color in obs]
# turns the list of colors into a numpy array of floats, then applies scipy's k-means function
clusters, _ = kmeans(array(clamped).astype(float), n)
colors = order_by_hue(clusters) if order_colors else clusters
hex_colors = list(map(hexify, colors)) # turn RGB into hex colors for web
return hex_colors
As an example, here’s what you get back for a single image:
single_image = colorz("https://tile.loc.gov/storage-services/service/pnp/bbc/0000/0000/0001f_150px.jpg", )
single_image
['#a3876b', '#beaf93', '#dad1b7', '#f1edde', '#6f6a54', '#434333']
Those are six hexadecimal colors, representing the clusters formed by the colors of each pixel. Because of the way the k-means clustering works, you may get slightly different colors each time you run the analysis.
Now let’s get the colors for the JPEGs.
all_images = list(map(colorz, image_list))
all_images[0:10]
[['#a69672', '#706242', '#8a7d5b', '#433c22', '#b4ad90', '#e1e0cc'],
['#b5a181', '#a39071', '#8f7f61', '#7a6b4f', '#61553d', '#c5b99d'],
['#514633', '#9f8f6f', '#72654b', '#8a7c5e', '#b2a383', '#cccab4'],
['#53412c', '#887054', '#c9ba9c', '#b1a284', '#e0d4b9', '#9a8e71'],
['#d7c4a7', '#b7a788', '#ebdfc4', '#96886a', '#70684f', '#454230'],
['#5f4930', '#8e7859', '#bcaa8a', '#d1c0a1', '#a39576', '#e7d9bc'],
['#816c55', '#c0a689', '#9e8d74', '#d5c1a3', '#e8dbc3', '#4b4738'],
['#9c764c', '#d0ae7d', '#efe8d4', '#ddcfa9', '#4b5a44', '#8ba886'],
['#eee8db', '#ddd3ba', '#929876', '#616a46', '#b7bea1', '#b9d4ca'],
['#e9dcd4', '#987659', '#c0aa8b', '#cac7ba', '#75a087', '#445f51']]
Great! Now let’s try another collection. To get URLs for item pages and JPEG images for the Works Progress Administration Posters, we can apply the same step of steps. Here we are again filtering for a date range to work with a more manageable number of images.
posters, poster_item_urls = get_image_urls("https://www.loc.gov/collections/works-progress-administration-posters/?fo=json&dates=1940/1949", [], [])
len(posters)
121
posters_colors = list(map(colorz, posters))
Temporarily storing the color analysis#
It can take a little while to run the colorz function on a set of images, and while you’re working with this notebook and experimenting, you might want to temporarily save those results into a pickle file (aka serialization). That allows you to reloaded the list of colors as a Python object for use later, without re-requesting and processing the images.
import pickle
with open("baseball-colors-list.txt", "wb") as f:
pickle.dump(all_images, f, pickle.HIGHEST_PROTOCOL)
To reload the pickled object:
with open("baseball-colors-list.txt", "rb") as infile:
baseball_colors = pickle.load(infile)
Drawing color swatches#
It’s great to have those color values, but really, we want to view them and be able to link out to the image on the Library of Congress website for more information.
The function below will draw square swatches of color for each of the six color clusters in an image. Clicking on the swatches will take you to that image’s web page. The function takes as input a tuple with the item page URL and color list. For example:
("https://tile.loc.gov/storage-services/service/pnp/bbc/0000/0010/0019f_150px.jpg", ['#cfbea1', '#e7e0ce', '#aaa78e', '#8f8d6b', '#777753', '#565940'])
def draw_row_with_links(link_and_colors):
'''
Draws a row of swatches for the main colors in each image.
Each swatch links to the image.
Args:
links_and_colors (tuple): A tuple containing the link as the first item and a list of color swatches as the second item.
Returns:
html: HTML containing the SVG color swatches.
'''
html = ""
url = link_and_colors[0]
for count, color in enumerate(link_and_colors[1]):
square = '<rect x="{0}" y="{1}" width="30" height="30" fill="{2}" />'.format(((count * 30) + 30), 0, color)
html += square
full_html = '<a href="{0}" target="_blank"><svg height="30" width="210">{1}</svg>'.format(url, html)
return full_html
We can show that HTML here in the notebook. This is a row of swatches for one image.
from IPython.display import display, HTML
single_item = (item_urls[0], all_images[0])
html = draw_row_with_links(single_item)
display(HTML(html))
Here are the swatches for all of the images in the Baseball Cards collection.
Colors in other collections#
Since the color analysis includes the entire image, it’s influenced by the framing colors and targets (the color ruler used to evaluate color accuracy).
Some collections such as Cartoons and Drawings and Japanese Prints pre-1915 sometimes include a target when the digitized image is a scan of a color transparency.
What are the colors in the Works Progress Administration Posters? Remember, we already got the URLs and colors above.
linkable_posters_page_html = ""
for image in zip(poster_item_urls, posters_colors):
line = draw_row_with_links(image)
linkable_posters_page_html += line
display(HTML(linkable_posters_page_html))
You can see that the colors in the images in that collection are rather different from those in the Baseball Cards collection.
Create JSON files from the data#
Let’s save the URLs and colors as a JSON file. This is a format we can then use in other applications or possibly other analysis tools.
The function below takes as input the collection slug name, the list of URLs (either the items or the JPEGs, depending on what you might want to do with the data later), the list of colors, and the JSON filename.
import json
def create_json(collection, item_urls, colors, filename):
'''
Takes the URLs and colors and saves them into a JSON file.
Args:
collection (str): A name for the collection.
item_urls: (list): The list of item urls.
colors: (list): The list of colors
filename: (str): The name of the file to save to.
Returns:
None
'''
data = {"collection": collection, "images": []}
with open(filename, 'w') as f:
for image in zip(item_urls, colors):
data["images"].append({"url": image[0], "colors": image[1]})
json.dump(data, f, ensure_ascii=False)
Note that running the next cell will create or edit a JSON file on your computer.
create_json("baseball-cards", item_urls, all_images, "baseball-cards-colors.json")