Downloading Search Results from Chronicling America

Downloading Search Results from Chronicling America#

Downloading search results is slightly different from downloading newspaper titles and batches. The steps in this example are modified so that the user will only download the pages of the search results. Furthermore, this method works specifically for newspapers. Other formats on loc.gov may behave differently.

Importing Modules [Required]#

The following imports are required for the scripts to run properly:

Run the following code below.
- It will import all the modules you need for this notebook.
- Do not change anything.

import requests
import os
import pandas as pd

Define your API Search Query and Save Location [Required]#

After running the Importing Modules code (above),

Paste your Search Query URL below, into the searchURL = '{URL}'
Edit the file type you wish to download in fileExtension = '{filetype}'. PDF works best. But options Include:
- pdf
- jp2
- Note: If you wish to download the jpg version of the files, we recommend you follow the IIIF example at: https://github.com/LibraryOfCongress/data-exploration/tree/master/loc.gov IIIF API.
Add the location where you want your files saved to in “saveTo”
When ready, Run the code.

# Perform Query - Paste your API Search Query URL into the searchURL
searchURL = 'https://www.loc.gov/collections/chronicling-america/?dl=page&end_date=1922-12-31&ops=PHRASE&qs=clara+bow&searchType=advanced&start_date=1922-12-01&fo=json'

# Add your desired file type (extension). Options Include: pdf, jpeg, and xml (OCR files)
fileExtension = 'pdf'

# Add your Local saveTo Location
saveTo = 'output'

Run Functions and Limits [Required]#

Functions and limits define what will be included and excluded in the search for downloads. The code below will only download the newspaper pages from your search result. It will not download the whole newspaper issue.

Run the code below.
- Do not change anything.
When the script is complete, it will tell you how many Newspaper Pages it found from your search.
If you are satisfied with the amount of results, proceed to the next section to run the download.
If you are not satisfied with the amount of results, go back and redo the API Search Query.

'''Run P1 search and get a list of results.'''
def get_item_ids(url, items=[], conditional='True'):
    # Check that the query URL is not an item or resource link.
    exclude = ["loc.gov/item","loc.gov/resource"]
    if any(string in url for string in exclude):
        raise NameError('Your URL points directly to an item or '
                        'resource page (you can tell because "item" '
                        'or "resource" is in the URL). Please use '
                        'a search URL instead. For example, instead '
                        'of \"https://www.loc.gov/item/2009581123/\", '
                        'try \"https://www.loc.gov/maps/?q=2009581123\". ')

    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    # Check that the API request was successful
    if (call.status_code==200) & ('json' in call.headers.get('content-type')):
        data = call.json()
        results = data['results']
        for result in results:
            # Filter out anything that's a colletion or web page
            filter_out = ("collection" in result.get("original_format")) \
                    or ("web page" in result.get("original_format")) \
                    or (eval(conditional)==False)
            if not filter_out:
                # Get the link to the item record
                if result.get("id"):
                    item = result.get("id")
                    # Filter out links to Catalog or other platforms
                    if item.startswith("http://www.loc.gov/resource"):
                      resource = item  # Assign item to resource
                      items.append(resource)
                    if item.startswith("http://www.loc.gov/item"):
                        items.append(item)
        # Repeat the loop on the next page, unless we're on the last page.
        if data["pagination"]["next"] is not None:
            next_url = data["pagination"]["next"]
            get_item_ids(next_url, items, conditional)

        return items
    else:
            print('There was a problem. Try running the cell again, or check your searchURL.')

# Create ids_list based on searchURL results
ids_list = get_item_ids(searchURL, items=[])

# prompt: add 'fo=json' to the end of each row in ids_list

new_ids = []
for id in ids_list:
  if not id.endswith('&fo=json'):
    id += '&fo=json'
  new_ids.append(id)
ids = new_ids

print('\nSuccess. Your API Search Query found '+str(len(new_ids))+' related newspaper pages. You may now continue.')

Success. Your API Search Query found 3 related newspaper pages. You may now continue.

Download Files#

If you want to download the found items, follow the instructions below.

Run the code below.
- Do not change anything.
When the script is complete, the downloads will be found in your “saveTo” location.
A list of downloaded files will also be present on the bottom.

print('\n'+str(len(new_ids))+' Downloaded Files')

# prompt: print page_url if it matches the fileExtension

for item in new_ids:
    call = requests.get(item)
    if call.status_code == 200:
        data = call.json()
        page = data['page']
        for page in page:
            if 'url' in page:
                page_url = page['url']
                if page_url.endswith(fileExtension):
                    print(page_url)

# Get the page URLs
page_urls = []
for item in new_ids:
    call = requests.get(item)
    if call.status_code == 200:
        data = call.json()
        page = data['page']
        for page in page:
            if 'url' in page:
                page_url = page['url']
                if page_url.endswith(fileExtension):
                    page_urls.append(page_url)


# prompt: create the folder and subfolder if they don't exist

for page_url in page_urls:
    # Extract the folder and filename from the URL
    batch_name = page_url.split('/')[-6]
    lccn_name = page_url.split('/')[-4]
    reel_name = page_url.split('/')[-3]
    issue_name = page_url.split('/')[-2]
    filename = page_url.split('/')[-1]

    # Create the batch folder if it doesn't exist
    batch_path = os.path.join(saveTo, batch_name, lccn_name)
    if not os.path.exists(batch_path):
        os.makedirs(batch_path)

    # Create the lccn folder if it doesn't exist
    lccn_path = os.path.join(saveTo, batch_name, lccn_name)
    if not os.path.exists(lccn_path):
        os.makedirs(lccn_path)

    # Create the reel folder if it doesn't exist
    reel_path = os.path.join(saveTo, batch_name, lccn_name, reel_name)
    if not os.path.exists(reel_path):
        os.makedirs(reel_path)

    # Create the issue subfolder if it doesn't exist
    issue_path = os.path.join(saveTo, batch_name, lccn_name, reel_name, issue_name)
    if not os.path.exists(issue_path):
        os.makedirs(issue_path)

    # Download the file
    response = requests.get(page_url)
    file_path = os.path.join(saveTo, batch_name, lccn_name, reel_name, issue_name, filename)
    with open(file_path, 'wb') as f:
        f.write(response.content)

print('\nSuccess! Please check your saveTo location to see the saved files. You can also redownload the selected files using the links above.')

3 Downloaded Files
https://tile.loc.gov/storage-services/service/ndnp/wyu/batch_wyu_ellison_ver01/data/sn92066979/0051701011A/1922122701/0427.pdf
https://tile.loc.gov/storage-services/service/ndnp/dlc/batch_dlc_dalek_ver01/data/sn83045462/00280657232/1922122601/0584.pdf
https://tile.loc.gov/storage-services/service/ndnp/uuml/batch_uuml_kloeden_ver01/data/sn85058393/print/1922121701/1327.pdf

Success! Please check your saveTo location to see the saved files. You can also redownload the selected files using the links above.

Get Basic Metadata/Information for your Downloaded Results#

If you need metadata/information for your downloads, run the script below The JSON parameters in the script can be changed per your requirements.

Run the code below.
When the script is complete, a preview will be shown on the bottom.

# Create a list of dictionaries to store the item metadata
item_metadata_list = []
# Iterate over the list of item IDs
for item in new_ids:
# Make the API call to get the item metadata
  item = requests.get(item)
# Check if the API call was successful and Parse the JSON response
  if item.status_code == 200:
    new_ids_json = item.json()

  # Extract the relevant item metadata
  Newspaper_Title = new_ids_json['item']['newspaper_title']
  Issue_Date = new_ids_json['item']['date']
  Page = new_ids_json['pagination']['current']
  State = new_ids_json['item']['location_state']
  City = new_ids_json['item']['location_city']
  LCCN = new_ids_json['item']['number_lccn']
  Contributor = new_ids_json['item']['contributor_names']
  Batch = new_ids_json['item']['batch']
  pdf = new_ids_json['resource']['pdf']

  # Add the item metadata to the list
  item_metadata_list.append({
    'Newspaper Title': Newspaper_Title,
    'Issue Date': Issue_Date,
    'Page Number': Page,
    'LCCN': LCCN,
    'City': City,
    'State': State,
    'Contributor': Contributor,
    'Batch': Batch,
    'PDF Link': pdf,
    })

# Create a Pandas DataFrame from the list of dictionaries
df = pd.DataFrame(item_metadata_list)

# Print the DataFrame
print(df)

                  Newspaper Title  Issue Date  Page Number          LCCN  \
      [The Laramie Republican]  1922-12-27            5  [sn92066979]   
               [Evening star.]  1922-12-26           22  [sn83045462]   
[The Ogden standard-examiner.]  1922-12-17           12  [sn85058393]   

           City                   State  \
   [laramie]               [wyoming]   
[washington]  [district of columbia]   
     [ogden]                  [utah]   

                              Contributor                 Batch  \
     [University of Wyoming Libraries]   [wyu_ellison_ver01]   
 [Library of Congress, Washington, DC]     [dlc_dalek_ver01]   
[University of Utah, Marriott Library]  [uuml_kloeden_ver01]   

                                            PDF Link  
https://tile.loc.gov/storage-services/service/...  
https://tile.loc.gov/storage-services/service/...  
https://tile.loc.gov/storage-services/service/...  

Export Metadata of Downloads to a CSV File.#

Before running the code, change MetadataFileName to the desired file name.

Run the code below.
When the script is complete, the downloads will be found in your “saveTo” download location.

# Set File Name. Make sure to rename the file so it doesn't overwrite previous!
filename = 'MetadataFileName'

print('\nSuccess! Please check your saveTo location to see the saved csv file. See Preview Below:\n')

metadata_dataframe = pd.DataFrame(item_metadata_list)
metadata_dataframe.to_csv(saveTo + '/' + filename + '.csv')
metadata_dataframe

Success! Please check your saveTo location to see the saved csv file. See Preview Below:

	Newspaper Title	Issue Date	Page Number	LCCN	City	State	Contributor	Batch	PDF Link
0	[The Laramie Republican]	1922-12-27	5	[sn92066979]	[laramie]	[wyoming]	[University of Wyoming Libraries]	[wyu_ellison_ver01]	https://tile.loc.gov/storage-services/service/...
1	[Evening star.]	1922-12-26	22	[sn83045462]	[washington]	[district of columbia]	[Library of Congress, Washington, DC]	[dlc_dalek_ver01]	https://tile.loc.gov/storage-services/service/...
2	[The Ogden standard-examiner.]	1922-12-17	12	[sn85058393]	[ogden]	[utah]	[University of Utah, Marriott Library]	[uuml_kloeden_ver01]	https://tile.loc.gov/storage-services/service/...