Using Chronicling America to analyze word frequency in newspaper Front Pages

Using Chronicling America to analyze word frequency in newspaper Front Pages#

Feel free to download this notebook and run with your own search query.

Notebook Example:

On August 18, 1920, Tennessee became the 36th state to ratify the 19th amendment which provided the final ratification necessary to add the amendment to the U.S. Constitution which gave women the right to vote. In this example, we look at the word frequency for the term “Suffrage” on the front pages of the newspapers on August 26, 1920 (when U.S. government officially certified state’s ratification of the 19th amendment) and the day after (August 27, 1920).

Importing Modules [Required]#

The following imports are required for the scripts to run properly:

Run the following code below.
- It will import all the modules you need for this notebook.
- Do not change anything.

import time
import re
import json
from urllib.request import urlopen
import requests
import pandas as pd
import pprint

Perform a Query#

Note: we are able to search only front pages because we include the parameter “front_pages_only=true” in the API URL

# Perform Query - Paste your API Search Query URL into the searchURL
searchURL = 'https://www.loc.gov/collections/chronicling-america/?dl=page&end_date=1920-08-27&front_pages_only=true&ops=AND&qs=suffrage&searchType=advanced&start_date=1920-08-26&fo=json'

Run Function#

Functions and limits define what will be included and excluded in the search for downloads.

The code below will only download the newspaper pages from your search result. It will not download the whole newspaper issue.

Run the code below.
- Do not change anything.
When the script is complete, it will tell you how many Newspaper Pages it found from your search.
If you are satisfied with the amount of results, proceed to the next section to run the download.
If you are not satisfied with the amount of results, go back and redo the API Search Query.

def get_item_ids(url, items=[], conditional='True'):
    # Check that the query URL is not an item or resource link.
    exclude = ["loc.gov/item","loc.gov/resource"]
    if any(string in url for string in exclude):
        raise NameError('Your URL points directly to an item or '
                        'resource page (you can tell because "item" '
                        'or "resource" is in the URL). Please use '
                        'a search URL instead. For example, instead '
                        'of \"https://www.loc.gov/item/2009581123/\", '
                        'try \"https://www.loc.gov/maps/?q=2009581123\". ')

    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    # Check that the API request was successful
    if (call.status_code==200) & ('json' in call.headers.get('content-type')):
        data = call.json()
        results = data['results']
        for result in results:
            # Filter out anything that's a colletion or web page
            filter_out = ("collection" in result.get("original_format")) \
                    or ("web page" in result.get("original_format")) \
                    or (eval(conditional)==False)
            if not filter_out:
                # Get the link to the item record
                if result.get("id"):
                    item = result.get("id")
                    # Filter out links to Catalog or other platforms
                    if item.startswith("http://www.loc.gov/resource"):
                      resource = item  # Assign item to resource
                      items.append(resource)
                    if item.startswith("http://www.loc.gov/item"):
                        items.append(item)
        # Repeat the loop on the next page, unless we're on the last page.
        if data["pagination"]["next"] is not None:
            next_url = data["pagination"]["next"]
            get_item_ids(next_url, items, conditional)

        return items
    else:
            print('There was a problem. Try running the cell again, or check your searchURL.')

# Generate a list of records found from performing a query and save these Item IDs. (Create ids_list based on items found in the searchURL result)
ids_list = get_item_ids(searchURL, items=[])

# Add 'fo=json' to the end of each row in ids_list (All individual ids from from the ids_list are now listed in JSON format in new_ids)
ids_list_json = []
for id in ids_list:
  if not id.endswith('&fo=json'):
    id += '&fo=json'
  ids_list_json.append(id)
ids = ids_list_json

print('\nSuccess! Your API Search Query found '+str(len(ids_list_json))+' related newspaper pages. You may proceed')

Success! Your API Search Query found 189 related newspaper pages. You may proceed

Get Basic Metadata/Information for your Query and Store It in a List#

If you need metadata/information for your downloads, run the script below The JSON parameters in the script can be changed per your requirements.

Run the code below after the previous step is completed.
When the script is complete, a message will be shown on the bottom.

# Create a list of dictionaries to store the item metadata
item_metadata_list = []

# Iterate over the list of item IDs
for item_id in ids_list_json:
  item_response = requests.get(item_id)

  # Check if the API call was successful and Parse the JSON response
  if item_response.status_code == 200:
    # Iterate over the ids_list_json list and extract the relevant metadata from each dictionary.
    item_data = item_response.json()
    if 'location_city' not in item_data['item']:
      continue

    # Extract the relevant item metadata
    Newspaper_Title = item_data['item']['newspaper_title']
    Issue_Date = item_data['item']['date']
    Page = item_data['pagination']['current']
    State = item_data['item']['location_state']
    City = item_data['item']['location_city']
    LCCN = item_data['item']['number_lccn']
    Contributor = item_data['item']['contributor_names']
    Batch = item_data['item']['batch']
    pdf = item_data['resource']['pdf']

    # Add the item metadata to the list
    item_metadata_list.append({
        'Newspaper Title': Newspaper_Title,
        'Issue Date': Issue_Date,
        'Page Number': Page,
        'LCCN': LCCN,
        'City': City,
        'State': State,
        'Contributor': Contributor,
        'Batch': Batch,
        'PDF Link': pdf,
  })

# Change date format to MM-DD-YYYY
for item in item_metadata_list:
  item['Issue Date'] = pd.to_datetime(item['Issue Date']).strftime('%m-%d-%Y')

# Create a Pandas DataFrame from the list of dictionaries
df = pd.DataFrame(item_metadata_list)

print('\nReady to proceed to the next step!')

Ready to proceed to the next step!

Export Metadata of Search Results to a CSV File#

Edit your save location and the filename below.
Then run the code.

# Add your Local saveTo Location (e.g. C:/Downloads/)
saveTo = 'output'

# Set File Name. Make sure to rename the file so it doesn't overwrite previous!
filename = 'MetadataFileName'

Press Run to save your file and print out a dataframe preview example below:

print('\nSuccess! Please check your saveTo location to see the saved csv file. See Preview Below:\n')

metadata_dataframe = pd.DataFrame(item_metadata_list)
metadata_dataframe.to_csv(saveTo + '/' + filename + '.csv')
metadata_dataframe

Success! Please check your saveTo location to see the saved csv file. See Preview Below:

	Newspaper Title	Issue Date	Page Number	LCCN	City	State	Contributor	Batch	PDF Link
0	[The American issue]	08-27-1920	1	[sn2008060406]	[westerville]	[ohio]	[University of Illinois at Urbana-Champaign Li...	[iune_bismuth_ver01]	https://tile.loc.gov/storage-services/service/...
1	[The Washington herald.]	08-27-1920	1	[sn83045433]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_frenchbulldog_ver04]	https://tile.loc.gov/storage-services/service/...
2	[The Chattanooga news.]	08-26-1920	1	[sn85038531]	[chattanooga]	[tennessee]	[University of Tennessee]	[tu_anita_ver01]	https://tile.loc.gov/storage-services/service/...
3	[The review]	08-26-1920	1	[sn91068415]	[high point]	[north carolina]	[University of North Carolina at Chapel Hill L...	[ncu_elk_ver01]	https://tile.loc.gov/storage-services/service/...
4	[Washington standard]	08-27-1920	1	[sn84022770]	[olympia]	[washington]	[Washington State Library; Olympia, WA]	[wa_kittitas_ver01]	https://tile.loc.gov/storage-services/service/...
...	...	...	...	...	...	...	...	...	...
184	[The Lambertville record]	08-27-1920	1	[sn84026089]	[lambertville]	[new jersey]	[Rutgers University Libraries]	[njr_beachhaven_ver01]	https://tile.loc.gov/storage-services/service/...
185	[Eagle River review]	08-27-1920	1	[sn85040614]	[eagle river]	[wisconsin]	[Wisconsin Historical Society]	[whi_bobidosh_ver01]	https://tile.loc.gov/storage-services/service/...
186	[New Mexico state record.]	08-27-1920	1	[sn93061701]	[santa fe.]	[new mexico]	[University of New Mexico]	[nmu_austen_ver01]	https://tile.loc.gov/storage-services/service/...
187	[Little Falls herald]	08-27-1920	1	[sn89064515]	[little falls]	[minnesota]	[Minnesota Historical Society; Saint Paul, MN]	[mnhi_peugeot_ver01]	https://tile.loc.gov/storage-services/service/...
188	[Milford chronicle]	08-27-1920	1	[sn87062224]	[milford]	[delaware]	[University of Delaware Library, Newark, DE]	[deu_deadpool_ver01]	https://tile.loc.gov/storage-services/service/...

189 rows × 9 columns

From data we’ve collected from newspapers found in Chronicling America:

During August 26, 1920 to August 27, 1920, the state of Tennessee had 14 different front pages with the term “Suffrage” when the U.S. Government certified the state’s ratification of the 19th Amendment on August 26, 1920. These 14 front pages came from 12 different Tennessee newspaper titles. This result is almost double every other state.