Using Chronicling America to analyze specific newspaper titles for content.

Using Chronicling America to analyze specific newspaper titles for content.#

Feel free to download this notebook and put in your own search queries.

Notebook Example:

The Washington Times newspaper printed a special children’s section called “Book of Magic.” This section contains children stories and coloring and puzzle activities.

Specifically, we will utilize the API tool to:

Narrow our search to a specific newspaper title and its content using a phrase.
Limit the search result so we only get the top 16 results.

Importing Modules [Required]#

The following imports are required for the scripts to run properly:

Run the following code below.
- It will import all the modules you need for this notebook.
- Do not change anything.

import time
import re
import json
from urllib.request import urlopen
import requests
import pandas as pd
import pprint

Perform a Query#

# Perform Query - Paste your API Search Query URL into the searchURL
searchURL = 'https://www.loc.gov/collections/chronicling-america/?dl=page&fa=partof_title:the+washington+times+%28washington+%5Bd.c.%5D%29+1902-1939&ops=PHRASE&qs=book+of+magic&searchType=advanced&fo=json'

Run Function#

Functions and limits define what will be included and excluded in the search for downloads.

The code below will only download the newspaper pages from your search result. It will not download the whole newspaper issue.

Run the code below.
- Do not change anything.
When the script is complete, it will tell you how many Newspaper Pages it found from your search.
If you are satisfied with the amount of results, proceed to the next section to run the download.
If you are not satisfied with the amount of results, go back and redo the API Search Query.

def get_item_ids(url, items=[], conditional='True'):
    # Check that the query URL is not an item or resource link.
    exclude = ["loc.gov/item","loc.gov/resource"]
    if any(string in url for string in exclude):
        raise NameError('Your URL points directly to an item or '
                        'resource page (you can tell because "item" '
                        'or "resource" is in the URL). Please use '
                        'a search URL instead. For example, instead '
                        'of \"https://www.loc.gov/item/2009581123/\", '
                        'try \"https://www.loc.gov/maps/?q=2009581123\". ')

    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    # Check that the API request was successful
    if (call.status_code==200) & ('json' in call.headers.get('content-type')):
        data = call.json()
        results = data['results'][:20] # This limits results to the top 20
        for result in results:
            # Filter out anything that's a collection or web page
            filter_out = ("collection" in result.get("original_format")) \
                    or ("web page" in result.get("original_format")) \
                    or (eval(conditional)==False)
            if not filter_out:
                # Get the link to the item record
                if result.get("id"):
                    item = result.get("id")
                    # Filter out links to Catalog or other platforms
                    if item.startswith("http://www.loc.gov/resource"):
                      resource = item  # Assign item to resource
                      items.append(resource)
                    if item.startswith("http://www.loc.gov/item"):
                      items.append(item)
        # Repeat the loop on the next page, unless we're on the last page.
        if data["pagination"]["next"] is not None:
            next_url = data["pagination"]["next"]
            get_item_ids(next_url, items, conditional)

        return items
    else:
            print('There was a problem. Try running the cell again, or check your searchURL.')

# Generate a list of records found from performing a query and save these Item IDs. (Create ids_list based on items found in the searchURL result)
ids_list = get_item_ids(searchURL, items=[])

# Add 'fo=json' to the end of each row in ids_list (All individual ids from from the ids_list are now listed in JSON format in new_ids)
ids_list_json = []
for id in ids_list:
  if not id.endswith('&fo=json'):
    id += '&fo=json'
  ids_list_json.append(id)
ids = ids_list_json

print('\nSuccess! Your API Search Query found '+str(len(ids_list_json))+' related newspaper pages. Proceed to the next step')

Success! Your API Search Query found 20 related newspaper pages. Proceed to the next step

Get Basic Metadata/Information for your Query and Store It in a List#

If you need metadata/information for your downloads, run the script below.

Before you run the code, make sure the function in the previous step was complete. It will have given you a success message.
The JSON parameters in the script can be changed per your requirements.

Run the code below.
When the script is complete, a message will appear.

# Create a list of dictionaries to store the item metadata
item_metadata_list = []

# Iterate over the list of item IDs
for item_id in ids_list_json:
  item_response = requests.get(item_id)

  # Check if the API call was successful and Parse the JSON response
  if item_response.status_code == 200:
    # Iterate over the ids_list_json list and extract the relevant metadata from each dictionary.
    item_data = item_response.json()
    if 'location_city' not in item_data['item']:
      continue

    # Extract the relevant item metadata
    Newspaper_Title = item_data['item']['newspaper_title']
    Issue_Date = item_data['item']['date']
    Page = item_data['pagination']['current']
    State = item_data['item']['location_state']
    City = item_data['item']['location_city']
    LCCN = item_data['item']['number_lccn']
    Contributor = item_data['item']['contributor_names']
    Batch = item_data['item']['batch']
    pdf = item_data['resource']['pdf']

    # Add the item metadata to the list
    item_metadata_list.append({
        'Newspaper Title': Newspaper_Title,
        'Issue Date': Issue_Date,
        'Page Number': Page,
        'LCCN': LCCN,
        'City': City,
        'State': State,
        'Contributor': Contributor,
        'Batch': Batch,
        'PDF Link': pdf,
  })

# Change date format to MM-DD-YYYY
for item in item_metadata_list:
  item['Issue Date'] = pd.to_datetime(item['Issue Date']).strftime('%m-%d-%Y')

# Create a Pandas DataFrame from the list of dictionaries
df = pd.DataFrame(item_metadata_list)

print('\nSuccess! Ready to proceed to the next step!')

Success! Ready to proceed to the next step!

Export Metadata of Search Results to a CSV File#

Edit your save location and the filename below.
Then run the code.

# Add your Local saveTo Location (e.g. C:/Downloads/)
saveTo = 'output'

# Set File Name. Make sure to rename the file so it doesn't overwrite previous!
filename = 'MetadataFileName'

Press Run to save your file and print out a dataframe preview example below:

print('\nSuccess! Please check your saveTo location to see the saved csv file. See Preview Below:\n')

metadata_dataframe = pd.DataFrame(item_metadata_list)
metadata_dataframe.to_csv(saveTo + '/' + filename + '.csv')
metadata_dataframe

Success! Please check your saveTo location to see the saved csv file. See Preview Below:

	Newspaper Title	Issue Date	Page Number	LCCN	City	State	Contributor	Batch	PDF Link
0	[The Washington times.]	11-28-1921	2	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_bordercollie_ver02]	https://tile.loc.gov/storage-services/service/...
1	[The Washington times.]	02-05-1922	35	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
2	[The Washington times.]	04-09-1922	57	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_chihuahua_ver02]	https://tile.loc.gov/storage-services/service/...
3	[The Washington times.]	02-05-1922	38	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
4	[The Washington times.]	12-11-1921	77	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_bordercollie_ver02]	https://tile.loc.gov/storage-services/service/...
5	[The Washington times.]	04-16-1922	54	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_chihuahua_ver02]	https://tile.loc.gov/storage-services/service/...
6	[The Washington times.]	02-05-1922	29	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
7	[The Washington times]	01-07-1923	59	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_melville_ver02]	https://tile.loc.gov/storage-services/service/...
8	[The Washington times.]	12-11-1921	74	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_bordercollie_ver02]	https://tile.loc.gov/storage-services/service/...
9	[The Washington times.]	02-05-1922	33	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
10	[The Washington times]	01-28-1923	67	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_melville_ver02]	https://tile.loc.gov/storage-services/service/...
11	[The Washington times.]	02-05-1922	31	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
12	[The Washington times.]	03-26-1922	59	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_bordercollie_ver02]	https://tile.loc.gov/storage-services/service/...
13	[The Washington times.]	02-05-1922	30	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
14	[The Washington times.]	03-05-1922	54	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_yorkie_ver01]	https://tile.loc.gov/storage-services/service/...
15	[The Washington times]	01-28-1923	62	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_melville_ver02]	https://tile.loc.gov/storage-services/service/...
16	[The Washington times.]	11-17-1922	19	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_chihuahua_ver02]	https://tile.loc.gov/storage-services/service/...
17	[The Washington times.]	11-15-1922	28	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_chihuahua_ver02]	https://tile.loc.gov/storage-services/service/...
18	[The Washington times]	02-17-1923	16	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_melville_ver02]	https://tile.loc.gov/storage-services/service/...
19	[The Washington times.]	12-21-1922	26	[sn84026749]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_chihuahua_ver02]	https://tile.loc.gov/storage-services/service/...