Color clusters in images in Library of Congress collections#

Color analysis is a fascinating way to explore and identify patterns in a group of images. The Library of Congress has thousands of digitized images online for the public. Let’s take a look at the colors in each image, by collection.

We’re going to use the Loc.gov JSON API to get the metadata about each collection’s items.

Get image URLs and thumbnails for each image in a specific collection#

First, we need to get a list URLs for both the items and their thumbnails in a specific collection. To get the thumbnail image, we’re taking the first URL in the “image_url” field. Sometimes there are multiple image sizes available, but the first one is generally the thumbnail.

import requests

The function below takes as input the URL for a collection. Be sure to include ?fo=json at the end so that you’re getting JSON, not the HTML for the web page. For example: https://www.loc.gov/collections/sanborn-maps/?fo=json

def get_image_urls(url, images=[], item_urls=[]):
    call = requests.get(url)
    data = call.json()
    results = data['results']
    for result in results:
        # don't get image from the collection-level result
        if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
            # sometimes the image_url field starts with https and sometimes not. 
            if "http" in result.get(("image_url")[0],""):
                image = result.get("image_url")[0] 
                images.append(image)
                item_url = result.get("id")
                item_urls.append(item_url)
            elif result.get("image_url"):
                image = "https:{0}".format(result.get("image_url")[0])
                images.append(image)
                item_url = result.get("id")
                item_urls.append(item_url)
            else:
                # some items don't have images available
                print("problem result, has no image_url: {0}.\n {1}".format(result.get("id"), result))     

    if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        get_image_urls(next_url, images, item_urls)
        
    return images, item_urls

Let’s get the URLs for the thumbnails for the Baseball Cards collection.

image_list, item_urls = get_image_urls("https://www.loc.gov/collections/baseball-cards/?fo=json")

How many thumbnail images did we identify in the Baseball Cards collection?

len(image_list)
2085

And let’s confirm that we have the same number of item URLs:

len(item_urls)
2085

Analyzing the thumbnail images for color clusters#

I’m adapting code from the colorz.py script in metakirby5’s the colorz code, which is a command line tool for a k-means color scheme generator. It runs an analysis on an image and presents an HTML page with swatches.

It uses scipy’s kmeans clustering algorithm to determine clusters of colors. Because of the way k-means clustering works, you might get different values each time you run it.

Since I wanted to display colors close to the actual colors in the images, I removed the script’s option to “brighten” colors. The script has the option to “clamp” the value of the colors, which adjusts them into a small range of values. This reduces the differences between the values, so you could see differences in hue more easily. I’ve chosen not to clamp the values for this data.

Colorz requires that you install the scipy and Pillow packages in your Python environment. You can do this with:

pip install scipy Pillow

These are also available via conda/Anaconda if you’re using that to manage your Python environment.

from PIL import Image
from sys import exit
from io import BytesIO
from colorsys import rgb_to_hsv, hsv_to_rgb
from scipy.cluster.vq import kmeans
from numpy import array
DEFAULT_NUM_COLORS = 6
# default minimum and maximum values are used to clamp the color values to a specific range
# originally this was set to 170 and 200, but I'm running with 0 and 256 in order to 
# not clamp the values. This can also be set as a parameter. 
DEFAULT_MINV = 0
DEFAULT_MAXV = 256

THUMB_SIZE = (200, 200)
SCALE = 256.0

def down_scale(x):
    return x / SCALE

def up_scale(x):
    return int(x * SCALE)

def clamp(color, min_v, max_v):
    """
    Clamps a color such that the value (lightness) is between min_v and max_v.
    """
    # use down_scale to convert color to value between 0-1 as expected by rgb_hsv
    h, s, v = rgb_to_hsv(*map(down_scale, color))
    # also convert the min_v and max_v to values between 0-1
    min_v, max_v = map(down_scale, (min_v, max_v))
    # get the maximum of the min value and the color's value (therefore bumping it up if needed)
    # then get the minimum of that number and the max_v (bumping the value down if needed)
    v = min(max(min_v, v), max_v)
    # convert the h, s, v(which has been clamped) to rgb, apply upscale to get it back to 0-255, return tuple R,G,B
    return tuple(map(up_scale, hsv_to_rgb(h, s, v)))

def order_by_hue(colors):
    """
    Orders colors by hue.
    """
    hsvs = [rgb_to_hsv(*map(down_scale, color)) for color in colors]
    hsvs.sort(key=lambda t: t[0])
    return [tuple(map(up_scale, hsv_to_rgb(*hsv))) for hsv in hsvs]

def get_colors(img):
    """
    Returns a list of all the image's colors.
    """
    w, h = img.size
    # convert('RGB') converts the image's pixels info to RGB 
    # getcolors() returns an unsorted list of (count, pixel) values
    # w * h ensures that maxcolors parameter is set so that each pixel could be unique
    # there are three values returned in a list
    return [color for count, color in img.convert('RGB').getcolors(w * h)]

def hexify(rgb):
    return "#{0:02x}{1:02x}{2:02x}".format(*rgb)

def colorz(image_url, n=DEFAULT_NUM_COLORS, min_v=DEFAULT_MINV, max_v=DEFAULT_MAXV,
           order_colors=True):
    """
    Get the n most dominant colors of an image.
    Clamps value to between min_v and max_v.

    Total number of colors returned is n, optionally ordered by hue.
    Returns as a list of RGB triples.

    """
    try:
        r = requests.get(image_url)
    except ValueError:
        print("{0} was not a valid URL.".format(image_file))
        exit(1)
    img = Image.open(BytesIO(r.content))
    img.thumbnail(THUMB_SIZE) # replace with a thumbnail with same aspect ratio, no larger than THUMB_SIZE
    obs = get_colors(img) # gets a list of RGB colors (e.g. (213, 191, 152)) for each pixel
    # adjust the value of each color, if you've chosen to change minimum and maximum values
    clamped = [clamp(color, min_v, max_v) for color in obs] 
    # turns the list of colors into a numpy array of floats, then applies scipy's k-means function
    clusters, _ = kmeans(array(clamped).astype(float), n) 
    colors = order_by_hue(clusters) if order_colors else clusters    
    hex_colors = list(map(hexify, colors)) # turn RGB into hex colors for web
    return hex_colors

As an example, here’s what you get back for a single image:

single_image = colorz("https://cdn.loc.gov/service/pnp/bbc/0000/0000/0001f_150px.jpg", )
single_image
['#a3876b', '#beaf93', '#dad1b7', '#f1edde', '#6f6a54', '#434333']

Those are six hexadecimal colors, representing the clusters formed by the colors of each pixel. Because of the way the k-means clustering works, you may get slightly different colors each time you run the analysis.

Now let’s get the colors for all of the image thumbnails.

all_images = list(map(colorz, image_list))
all_images[0:10]
[['#a2876b', '#bdaf93', '#dad1b7', '#f1edde', '#6f6a54', '#434333'],
 ['#60503c', '#8e7a5c', '#b09f7e', '#d0c09f', '#e9ddc1', '#f7f4e7'],
 ['#b49e84', '#8e7a61', '#493f2f', '#cdbfa4', '#e3d7bc', '#f2eede'],
 ['#795336', '#ab8964', '#dacdb1', '#c4b595', '#e7e1cd', '#f4f1e6'],
 ['#6c4d37', '#9a7d5d', '#bfa783', '#dcc69e', '#f3f2e4', '#e0dec5'],
 ['#73563f', '#9a7f61', '#bea684', '#dcc7a2', '#f5f2e5', '#e2dfc7'],
 ['#5c4d3c', '#9a8265', '#b5a687', '#e2dbc5', '#c9c1a7', '#f3f0e3'],
 ['#d9c5ac', '#b4a58e', '#887b64', '#504838', '#f3efe3', '#e4dcc7'],
 ['#79533b', '#ac8460', '#d7b78e', '#e9dfc1', '#f6f5e8', '#a2b8a7'],
 ['#af9e77', '#efe3c8', '#686045', '#d2cbac', '#bfb790', '#8d8662']]

Great! Now let’s try another collection. To get URLs for item pages and thumbnail images for the Works Progress Administration Posters, we can apply the same step of steps.

posters, poster_item_urls = get_image_urls("https://www.loc.gov/collections/works-progress-administration-posters/?fo=json", [], [])
len(posters)
931
posters_colors = list(map(colorz, posters))

Temporarily storing the color analysis#

It can take a little while to run the colorz function on a set of images, and while you’re working with this notebook and experimenting, you might want to temporarily save those results into a pickle file (aka serialization). That allows you to reloaded the list of colors as a Python object for use later, without re-requesting and processing the images.

import pickle

with open("baseball-colors-list.txt", "wb") as f:
    pickle.dump(all_images, f, pickle.HIGHEST_PROTOCOL)

To reload the pickled object:

with open("baseball-colors-list.txt", "rb") as infile:
    baseball_colors = pickle.load(infile)

Drawing color swatches#

It’s great to have those color values, but really, we want to view them and be able to link out to the image on the Library of Congress website for more information.

The function below will draw square swatches of color for each of the six color clusters in an image. Clicking on the swatches will take you to that image’s web page. The function takes as input a tuple with the item page URL and color list. For example:

("https://cdn.loc.gov/service/pnp/bbc/0000/0010/0019f_150px.jpg", ['#cfbea1', '#e7e0ce', '#aaa78e', '#8f8d6b', '#777753', '#565940'])

def draw_row_with_links(link_and_colors):
    html = ""
    url = link_and_colors[0]
    for count, color in enumerate(link_and_colors[1]):
        square = '<rect x="{0}" y="{1}" width="30" height="30" fill="{2}" />'.format(((count * 30) + 30), 0, color)
        html += square
    full_html = '<a href="{0}" target="_blank"><svg height="30" width="210">{1}</svg>'.format(url, html)    
    return full_html

We can show that HTML here in the notebook. This is a row of swatches for one image.

from IPython.display import display, HTML

single_item = (item_urls[0], all_images[0])
html = draw_row_with_links(single_item)

display(HTML(html))

Here are the swatches for all of the images in the Baseball Cards collection.

Colors in other collections#

Since the color analysis includes the entire image, it’s influenced by the framing colors and targets (the color ruler used to evaluate color accuracy).

Some collections such as Cartoons and Drawings and Japanese Prints pre-1915 sometimes include a target when the digitized image is a scan of a color transparency.

What are the colors in the Works Progress Administration Posters? Remember, we already got the URLs and colors above.

linkable_posters_page_html = ""

for image in zip(poster_item_urls, posters_colors):
    line = draw_row_with_links(image)
    linkable_posters_page_html += line
    
display(HTML(linkable_posters_page_html))

You can see that the colors in the images in that collection are rather different from those in the Baseball Cards collection.

Create JSON files from the data#

Let’s save the URLs and colors as a JSON file. This is a format we can then use in other applications or possibly other analysis tools.

The function below takes as input the collection slug name, the list of URLs (either the items or the thumbnails, depending on what you might want to do with the data later), the list of colors, and the JSON filename.

import json

def create_json(collection, item_urls, colors, filename):
    data = {"collection": collection, "images": []}
    with open(filename, 'w') as f:
        for image in zip(item_urls, colors):
            data["images"].append({"url": image[0], "colors": image[1]})    
        json.dump(data, f, ensure_ascii=False)
create_json("baseball-cards", item_urls, all_images, "baseball-cards-colors.json")