Color clusters in images in Library of Congress collections#
Color analysis is a fascinating way to explore and identify patterns in a group of images. The Library of Congress has thousands of digitized images online for the public. Let’s take a look at the colors in each image, by collection.
We’re going to use the Loc.gov JSON API to get the metadata about each collection’s items.
Get image URLs and thumbnails for each image in a specific collection#
First, we need to get a list URLs for both the items and their thumbnails in a specific collection. To get the thumbnail image, we’re taking the first URL in the “image_url” field. Sometimes there are multiple image sizes available, but the first one is generally the thumbnail.
import requests
The function below takes as input the URL for a collection. Be sure to include ?fo=json
at the end so that you’re getting JSON, not the HTML for the web page. For example: https://www.loc.gov/collections/sanborn-maps/?fo=json
def get_image_urls(url, images=[], item_urls=[]):
call = requests.get(url)
data = call.json()
results = data['results']
for result in results:
# don't get image from the collection-level result
if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
# sometimes the image_url field starts with https and sometimes not.
if "http" in result.get(("image_url")[0],""):
image = result.get("image_url")[0]
images.append(image)
item_url = result.get("id")
item_urls.append(item_url)
elif result.get("image_url"):
image = "https:{0}".format(result.get("image_url")[0])
images.append(image)
item_url = result.get("id")
item_urls.append(item_url)
else:
# some items don't have images available
print("problem result, has no image_url: {0}.\n {1}".format(result.get("id"), result))
if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
next_url = data["pagination"]["next"]
get_image_urls(next_url, images, item_urls)
return images, item_urls
Let’s get the URLs for the thumbnails for the Baseball Cards collection.
image_list, item_urls = get_image_urls("https://www.loc.gov/collections/baseball-cards/?fo=json")
How many thumbnail images did we identify in the Baseball Cards collection?
len(image_list)
2085
And let’s confirm that we have the same number of item URLs:
len(item_urls)
2085
Analyzing the thumbnail images for color clusters#
I’m adapting code from the colorz.py script in metakirby5’s the colorz code, which is a command line tool for a k-means color scheme generator. It runs an analysis on an image and presents an HTML page with swatches.
It uses scipy’s kmeans clustering algorithm to determine clusters of colors. Because of the way k-means clustering works, you might get different values each time you run it.
Since I wanted to display colors close to the actual colors in the images, I removed the script’s option to “brighten” colors. The script has the option to “clamp” the value of the colors, which adjusts them into a small range of values. This reduces the differences between the values, so you could see differences in hue more easily. I’ve chosen not to clamp the values for this data.
Colorz requires that you install the scipy and Pillow packages in your Python environment. You can do this with:
pip install scipy Pillow
These are also available via conda/Anaconda if you’re using that to manage your Python environment.
from PIL import Image
from sys import exit
from io import BytesIO
from colorsys import rgb_to_hsv, hsv_to_rgb
from scipy.cluster.vq import kmeans
from numpy import array
DEFAULT_NUM_COLORS = 6
# default minimum and maximum values are used to clamp the color values to a specific range
# originally this was set to 170 and 200, but I'm running with 0 and 256 in order to
# not clamp the values. This can also be set as a parameter.
DEFAULT_MINV = 0
DEFAULT_MAXV = 256
THUMB_SIZE = (200, 200)
SCALE = 256.0
def down_scale(x):
return x / SCALE
def up_scale(x):
return int(x * SCALE)
def clamp(color, min_v, max_v):
"""
Clamps a color such that the value (lightness) is between min_v and max_v.
"""
# use down_scale to convert color to value between 0-1 as expected by rgb_hsv
h, s, v = rgb_to_hsv(*map(down_scale, color))
# also convert the min_v and max_v to values between 0-1
min_v, max_v = map(down_scale, (min_v, max_v))
# get the maximum of the min value and the color's value (therefore bumping it up if needed)
# then get the minimum of that number and the max_v (bumping the value down if needed)
v = min(max(min_v, v), max_v)
# convert the h, s, v(which has been clamped) to rgb, apply upscale to get it back to 0-255, return tuple R,G,B
return tuple(map(up_scale, hsv_to_rgb(h, s, v)))
def order_by_hue(colors):
"""
Orders colors by hue.
"""
hsvs = [rgb_to_hsv(*map(down_scale, color)) for color in colors]
hsvs.sort(key=lambda t: t[0])
return [tuple(map(up_scale, hsv_to_rgb(*hsv))) for hsv in hsvs]
def get_colors(img):
"""
Returns a list of all the image's colors.
"""
w, h = img.size
# convert('RGB') converts the image's pixels info to RGB
# getcolors() returns an unsorted list of (count, pixel) values
# w * h ensures that maxcolors parameter is set so that each pixel could be unique
# there are three values returned in a list
return [color for count, color in img.convert('RGB').getcolors(w * h)]
def hexify(rgb):
return "#{0:02x}{1:02x}{2:02x}".format(*rgb)
def colorz(image_url, n=DEFAULT_NUM_COLORS, min_v=DEFAULT_MINV, max_v=DEFAULT_MAXV,
order_colors=True):
"""
Get the n most dominant colors of an image.
Clamps value to between min_v and max_v.
Total number of colors returned is n, optionally ordered by hue.
Returns as a list of RGB triples.
"""
try:
r = requests.get(image_url)
except ValueError:
print("{0} was not a valid URL.".format(image_file))
exit(1)
img = Image.open(BytesIO(r.content))
img.thumbnail(THUMB_SIZE) # replace with a thumbnail with same aspect ratio, no larger than THUMB_SIZE
obs = get_colors(img) # gets a list of RGB colors (e.g. (213, 191, 152)) for each pixel
# adjust the value of each color, if you've chosen to change minimum and maximum values
clamped = [clamp(color, min_v, max_v) for color in obs]
# turns the list of colors into a numpy array of floats, then applies scipy's k-means function
clusters, _ = kmeans(array(clamped).astype(float), n)
colors = order_by_hue(clusters) if order_colors else clusters
hex_colors = list(map(hexify, colors)) # turn RGB into hex colors for web
return hex_colors
As an example, here’s what you get back for a single image:
single_image = colorz("https://cdn.loc.gov/service/pnp/bbc/0000/0000/0001f_150px.jpg", )
single_image
['#a3876b', '#beaf93', '#dad1b7', '#f1edde', '#6f6a54', '#434333']
Those are six hexadecimal colors, representing the clusters formed by the colors of each pixel. Because of the way the k-means clustering works, you may get slightly different colors each time you run the analysis.
Now let’s get the colors for all of the image thumbnails.
all_images = list(map(colorz, image_list))
all_images[0:10]
[['#a2876b', '#bdaf93', '#dad1b7', '#f1edde', '#6f6a54', '#434333'],
['#60503c', '#8e7a5c', '#b09f7e', '#d0c09f', '#e9ddc1', '#f7f4e7'],
['#b49e84', '#8e7a61', '#493f2f', '#cdbfa4', '#e3d7bc', '#f2eede'],
['#795336', '#ab8964', '#dacdb1', '#c4b595', '#e7e1cd', '#f4f1e6'],
['#6c4d37', '#9a7d5d', '#bfa783', '#dcc69e', '#f3f2e4', '#e0dec5'],
['#73563f', '#9a7f61', '#bea684', '#dcc7a2', '#f5f2e5', '#e2dfc7'],
['#5c4d3c', '#9a8265', '#b5a687', '#e2dbc5', '#c9c1a7', '#f3f0e3'],
['#d9c5ac', '#b4a58e', '#887b64', '#504838', '#f3efe3', '#e4dcc7'],
['#79533b', '#ac8460', '#d7b78e', '#e9dfc1', '#f6f5e8', '#a2b8a7'],
['#af9e77', '#efe3c8', '#686045', '#d2cbac', '#bfb790', '#8d8662']]
Great! Now let’s try another collection. To get URLs for item pages and thumbnail images for the Works Progress Administration Posters, we can apply the same step of steps.
posters, poster_item_urls = get_image_urls("https://www.loc.gov/collections/works-progress-administration-posters/?fo=json", [], [])
len(posters)
931
posters_colors = list(map(colorz, posters))
Temporarily storing the color analysis#
It can take a little while to run the colorz function on a set of images, and while you’re working with this notebook and experimenting, you might want to temporarily save those results into a pickle file (aka serialization). That allows you to reloaded the list of colors as a Python object for use later, without re-requesting and processing the images.
import pickle
with open("baseball-colors-list.txt", "wb") as f:
pickle.dump(all_images, f, pickle.HIGHEST_PROTOCOL)
To reload the pickled object:
with open("baseball-colors-list.txt", "rb") as infile:
baseball_colors = pickle.load(infile)
Drawing color swatches#
It’s great to have those color values, but really, we want to view them and be able to link out to the image on the Library of Congress website for more information.
The function below will draw square swatches of color for each of the six color clusters in an image. Clicking on the swatches will take you to that image’s web page. The function takes as input a tuple with the item page URL and color list. For example:
("https://cdn.loc.gov/service/pnp/bbc/0000/0010/0019f_150px.jpg", ['#cfbea1', '#e7e0ce', '#aaa78e', '#8f8d6b', '#777753', '#565940'])
def draw_row_with_links(link_and_colors):
html = ""
url = link_and_colors[0]
for count, color in enumerate(link_and_colors[1]):
square = '<rect x="{0}" y="{1}" width="30" height="30" fill="{2}" />'.format(((count * 30) + 30), 0, color)
html += square
full_html = '<a href="{0}" target="_blank"><svg height="30" width="210">{1}</svg>'.format(url, html)
return full_html
We can show that HTML here in the notebook. This is a row of swatches for one image.
from IPython.display import display, HTML
single_item = (item_urls[0], all_images[0])
html = draw_row_with_links(single_item)
display(HTML(html))
Here are the swatches for all of the images in the Baseball Cards collection.
Colors in other collections#
Since the color analysis includes the entire image, it’s influenced by the framing colors and targets (the color ruler used to evaluate color accuracy).
Some collections such as Cartoons and Drawings and Japanese Prints pre-1915 sometimes include a target when the digitized image is a scan of a color transparency.
What are the colors in the Works Progress Administration Posters? Remember, we already got the URLs and colors above.
linkable_posters_page_html = ""
for image in zip(poster_item_urls, posters_colors):
line = draw_row_with_links(image)
linkable_posters_page_html += line
display(HTML(linkable_posters_page_html))
You can see that the colors in the images in that collection are rather different from those in the Baseball Cards collection.
Create JSON files from the data#
Let’s save the URLs and colors as a JSON file. This is a format we can then use in other applications or possibly other analysis tools.
The function below takes as input the collection slug name, the list of URLs (either the items or the thumbnails, depending on what you might want to do with the data later), the list of colors, and the JSON filename.
import json
def create_json(collection, item_urls, colors, filename):
data = {"collection": collection, "images": []}
with open(filename, 'w') as f:
for image in zip(item_urls, colors):
data["images"].append({"url": image[0], "colors": image[1]})
json.dump(data, f, ensure_ascii=False)
create_json("baseball-cards", item_urls, all_images, "baseball-cards-colors.json")