LoC Data Package Tutorial: United States Elections, Web Archives Data Package#

version 2.0

This notebook will demonstrate basic usage of using Python for interacting with data packages from the Library of Congress via the United States Elections, Web Archives Data Package which is derived from the Library’s United States Elections Web Archive. We will:

  1. Output data package sumary

  2. Query the metadata in the data package

  3. Filter and download CDX index files, analyze text

Prerequisites#

In order to run this notebook, please follow the instructions listed in this directory’s README.

Output data package summary#

First, we will select United States Elections, Web Archives Data Package and output a summary of its files.

import ast  # For reading structured data from metadata.csv
import pandas as pd  # For reading, manipulating, and displaying data
import requests  # For retrieving online files
import sys # For general system tasks

from helpers import get_file_stats, make_request

# Set general variables we'll use throughout
DATA_URL = 'https://data.labs.loc.gov/us-elections/' # Base URL of this data package
PYTHON_VERSION = sys.version.split('|')[0] # We will use this in our request headers
HEADERS = { # This allows us to declare ourselves to Library of Congress servers
    'User-Agent':f'https://github.com/LibraryOfCongress/data-exploration/blob/master/Data Packages/us-elections.ipynb : 2.0 (python : {PYTHON_VERSION})'
    } 

# Download the file manifest
file_manifest_url = f'{DATA_URL}manifest.json'
is_blocked, response = make_request(file_manifest_url, json=True)
if response is None:
    print(f'There was an error retrieving the manifest file at {DATA_URL}manifest.json')
files = [dict(zip(response["cols"], row)) for row in response["rows"]] # zip columns and rows

# Convert to Pandas DataFrame and show stats table
stats = get_file_stats(files)
pd.DataFrame(stats)
FileType Count Size
0 .gz 394,950 227.8GB

Query the metadata in the data package#

Next we will download this data package’s metadata.csv file, print a summary of various values, and demonstrate filtering options.

The metadata.csv file lists all of the US election political candidates websites that have been collected as part of the United States Elections Web Archive and which are expected to be indexed in this data package’s CDX index files. To read more about this data package’s scope, see its README.

Because the CDX index files are a mixed bag of additional content, the metadata.csv file can be used to target content just from political candidate website domains.

metadata_url = f'{DATA_URL}metadata.json'
is_blocked, response = make_request(metadata_url, headers=HEADERS)
data = response.json()

metadata_df = pd.DataFrame(data)

print(f'Loaded metadata file with {len(metadata_df):,} entries.')
Loaded metadata file with 13,388 entries.

Next let’s convert to pandas DataFrame and print the available properties.

# metadata_df = pd.read_csv(r"C:\Users\rtrent\git\lcwa-election-datasets\metadata\full_metadata_2000-2016.csv", dtype=str) # just for testing
print(', '.join(metadata_df.columns.to_list()))
item_id, item_title, website_url, website_id, website_scopes, collection, website_elections, website_parties, website_places, website_districts, website_thumbnail, website_start_date, website_end_date, item_all_years, website_all_years, mods_url, access_condition

Let’s check the campaign years represented in metadata.csv.

collections = metadata_df['collection'].dropna().unique()
years = [collection.split(', ')[1] for collection in collections]
years.sort()
years
['2000', '2002', '2004', '2006', '2008', '2010', '2012', '2014', '2016']

Interpreting the metadata fields#

The fields are defined in this package’s README. Each row is a particular website collected for a specific candidate in a single election year.

Let’s look at an example row to understand how to interpret the fields. We’ll write out a paragraph describing our example row. We’ll look at row #3460 (which we happen to know represents the only candidate in metadata.csv to have campaigned in two races in the same year under different parties):

# First, let's make sure that our dataframe columns with lists are interpretted correctly. 
metadata_df['website_elections'] = metadata_df['website_elections'].apply(ast.literal_eval)
metadata_df['website_parties'] = metadata_df['website_parties'].apply(ast.literal_eval)
metadata_df['website_places'] = metadata_df['website_places'].apply(ast.literal_eval)
metadata_df['website_districts'] = metadata_df['website_districts'].apply(ast.literal_eval)
metadata_df['item_all_years'] = metadata_df['item_all_years'].apply(ast.literal_eval)
metadata_df['website_all_years'] = metadata_df['website_all_years'].apply(ast.literal_eval)
metadata_df['website_scopes'] = metadata_df['website_scopes'].apply(ast.literal_eval)
row = 3460 # You can change this row number

# We'll grab all the info we need from our row. 
item_title = metadata_df.iloc[row]['item_title']
website_url = metadata_df.iloc[row]['website_url']
collection = metadata_df.iloc[row]['collection']
candidate_name = item_title.split('-')[1].strip()
year = collection.split(',')[1].strip()
campaign_count = len(metadata_df.iloc[row]['website_elections'])
website_elections = metadata_df.iloc[row]['website_elections']
website_parties = metadata_df.iloc[row]['website_parties']
website_places = metadata_df.iloc[row]['website_places']
website_districts = metadata_df.iloc[row]['website_districts']
website_all_years = metadata_df.iloc[row]['website_all_years']
website_all_years.sort()
item_all_years = metadata_df.iloc[row]['item_all_years']
item_all_years.sort()
item_id = metadata_df.iloc[row]['item_id']
mods_url = metadata_df.iloc[row]['mods_url']

# Now we'll plug those variables into our sentences.
print(f'Record #{row} in the metadata.csv is: {website_url}, from the collection "{collection}".')
print(f'This row represents the website in {year}, used for campaign(s) of the candidate: {candidate_name}.') 
print(f'In {year}, this candidate used this website in {campaign_count} campaign(s):')
i=0
while i < campaign_count:
    if website_districts[i] is None:
        house_district = ''  
    else:
        house_district = website_districts[i]
    print(f'  {i}. {website_elections[i]} | {website_parties[i]} | {website_places[i]} | {house_district}')
    i += 1
if len(website_all_years)>1:   
    print(f'This website ({website_url}) was also used for these other campaign year(s) for {candidate_name}: {list(set(website_all_years)-set(year))}')
print(f'In total, this and possibly other websites were collected for this candidate in the following year(s): {list(set(item_all_years))}')
print(f'The loc.gov item record for {candidate_name} campaign sites can be viewed at {item_id}, and its MODS record can be viewed at {mods_url}.')

# The next line displays our dataframe as a table. Let's set it to show up to 300 characters in each cell
pd.options.display.max_colwidth = 300

print('Here is how this row appears in `metadata.csv`:')                       
metadata_df[row:row+1]
Record #3460 in the metadata.csv is: http://www.usmjp.com/, from the collection "United States Elections, 2012".
This row represents the website in 2012, used for campaign(s) of the candidate: Cris Ericson.
In 2012, this candidate used this website in 2 campaign(s):
  0. United States. Congress. Senate | U.S. Marijuana Party | Vermont | 
  1. Vermont. Governor | Independent candidates | Vermont | 
In total, this and possibly other websites were collected for this candidate in the following year(s): [2018, 2002, 2004, 2006, 2008, 2010, 2012]
The loc.gov item record for Cris Ericson campaign sites can be viewed at http://www.loc.gov/item/lcwaN0002501/, and its MODS record can be viewed at https://tile.loc.gov/storage-services/service/webcapture/project_1/mods/united-states-elections-web-archive/lcwaN0002501.xml.
Here is how this row appears in `metadata.csv`:
item_id item_title website_url website_id website_scopes collection website_elections website_parties website_places website_districts website_thumbnail website_start_date website_end_date item_all_years website_all_years mods_url access_condition
3460 http://www.loc.gov/item/lcwaN0002501/ Official Campaign Web Site - Cris Ericson http://www.usmjp.com/ 3415 [http://crisericson.com, http://vermontnews.livejournal.com, http://www.myspace.com/usmjp2010, http://crisericson2010.blogspot.com] United States Elections, 2012 [United States. Congress. Senate, Vermont. Governor] [U.S. Marijuana Party, Independent candidates] [Vermont, Vermont] [None, None] http://cdn.loc.gov/service/webcapture/project_1/thumbnails/lcwaS0003415.jpg 20121003 20121019 [2002, 2004, 2004, 2006, 2008, 2010, 2012, 2012, 2018, 2018] [2012] https://tile.loc.gov/storage-services/service/webcapture/project_1/mods/united-states-elections-web-archive/lcwaN0002501.xml None

Now let’s see about all the Vermont gubernatorial candidates represented in this data package.

# We'll create a function to generate summary information about a given type of election

def election_summary(election_type):
    websites_by_year = metadata_df[metadata_df['website_elections'].apply(lambda elections: any(election_type is election for election in elections ))]
    candidates = websites_by_year['item_title'].unique()
    websites = websites_by_year['website_url'].unique()
    years = [collection.split(',')[1].strip() for collection in websites_by_year['collection'].unique()]
    min_year = min(years) if years else 'n/a'
    max_year = max(years) if years else 'n/a'
    multi_year_websites = websites_by_year[websites_by_year['website_all_years'].str.len()>1]['website_url'].unique()
    print(f'Found in metadata.csv: {len(websites)} unique campaign websites for {len(candidates)} "{election_type}" candidates, ranging from years {min_year} - {max_year}.')
    print(f'{len(multi_year_websites)} of these websites were used multiple years.')

election_summary('Vermont. Governor')
Found in metadata.csv: 0 unique campaign websites for 0 "Vermont. Governor" candidates, ranging from years n/a - n/a.
0 of these websites were used multiple years.

Off-year elections aren’t represented in this data package even though they are in the United States Elections Web Archive online collection. This is due to the way that content is organized in CDX files.

For example, Virginia’s gubernatorial elections are off-year elections (in odd-numbered years), and thus are not represented in this data package even though they are in the online collection.

After you run the next cell, try replacing “Virginia. Governor” with something like “United States. Congress. Senate”, “United States. President”, or “Michigan. Governor”

election_summary('Virginia. Governor')
Found in metadata.csv: 0 unique campaign websites for 0 "Virginia. Governor" candidates, ranging from years n/a - n/a.
0 of these websites were used multiple years.

Filter and Download CDX index files, analyze text#

The bulk of this dataset are CDX files. In this section, we’ll retrieve a small sample of those CDX files and analyze the text inside them.

Here we will define the functions in the order that they are used in this section of the notebook.

from bs4 import BeautifulSoup # Used to process the scraped content
import gzip # Used to decompress the gzipped CDX files
from sklearn.feature_extraction.text import CountVectorizer # Used to create a matrix out of a bag of words
from time import sleep # Used to provide a slight pause between requests


WAYBACK_BASE_URL = 'https://webarchive.loc.gov/all/'
WAYBACK_LEGACY_BASE_URL = 'https://webarchive.loc.gov/legacy/'

def gather_files_from_manifest(year: str):
    """
    Function that takes a year (YYYY) as an argument.
    The function collects the locations of the CDX files 
    listed by the provided year's manifest.
    
    Args:
        year (str): String of a year YYYY.

    Returns:
        :obj:`list` of :obj:`str` of individual CDX file URLs. In case
        of error, returns an empty list.
    """
    
    election_years = [
        "2000",
        "2002",
        "2004",
        "2006",
        "2008",
        "2010",
        "2012",
        "2014",
        "2016"
    ]

    if year not in election_years:
        return []
    else:
        try:
            manifest_url = f"{DATA_URL}by-year/{year}/manifest.html"
            is_blocked, response = make_request(manifest_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            cdx_files = [link.get('href') for link in soup.find_all('a')]
            return cdx_files
        except:
            print(f'There was an error retrieving and/or parsing {manifest_url}.')
            return []


def fetch_file(cdx_url: str):
    """
    Function that takes a `String` as an argument.
    The `cdx_url` is a singular item from the result
    of the `gather_files_from_manifest` function.
    The function fetches the gzipped CDX file, decompresses it,
    splits it on the newlines, and removes the header. 
    Args:
        cdx_url (str): Individual item from the result of
        the `gather_files_from_manifest` function.

    Returns:
        :obj:`list` of :obj:`str` of individual CDX lines, each representing
        a web object. Returns an empty list in case of errors.
    """
    # Get the CDX file. For a production script, you'll want to build in additional error handling. 
    try:
        response = requests.get(cdx_url)
    except:
        response = None
    
    # Here we decompress the gzipped CDX, decode it, split it on the newline, and remove the header
    try:
        cdx_content = gzip.decompress(response.content).decode('utf-8').split('\n')[1:]
        return cdx_content
    except:
        print(f'There was an error parsing the decompressing  CDX file: {cdx_url}. This file will be skipped.')
        return []


def create_dataframe(data: list):
    """
    Function that takes a :obj:`list` of :obj:`str` as an argument.
    `data` is the contents of the CDX file split on newlines. 
    This function takes `data`, applies a schema to it, and transforms it
    into a `pandas.DataFrame`.
    Args:
        data (list): :obj:`list` of :obj:`str`. Each item is a line from
        a CDX file or group of files.

    Returns:
        A `pandas.DataFrame` of a CDX file or group of files. In case of error,
        a blank pandas.DataFrame is returned.
    """
    schema = [
        'urlkey',
        'timestamp',
        'original',
        'mimetype',
        'statuscode',
        'digest',
        'redirect',
        'metatags',
        'file_size',
        'offset',
        'warc_filename'
    ]
    try:
        _data = [row.split() for row in data]
        df = pd.DataFrame(_data, columns=schema)
        return df
    except:
        print('There was an error converting the data into a dataframe. Returning\
              a blank dataframe.')
        return pd.DataFrame()

def create_dataframe_from_manifest(manifest: list):
    """
    Function that takes a :obj:`list` of :obj:`str` as an argument.
    The `manifest` is a list of all the individual CDX files found
    from an Election year's or group of Election years' HTML manifest.
    This function loops through each file, transforms it into a `pandas.DataFrame`
    by calling the `create_dataframe` function, concats the DataFrames together,
    and then returns the Dataframe representing the entire manifest.
    Args:
        manifest (list): :obj:`list` of :obj:`str` of all the individual CDX files found
    from an Election year's or group of Election years' HTML manifest.

    Returns:
        `pandas.DataFrame` representing every file present in the `manifest`.
    """
    df = pd.DataFrame() 
    for index, cdx_url in enumerate(manifest):
        cdx = fetch_file(cdx_url)
        if len(cdx) == 0:
            continue
        try:
            new_rows = create_dataframe(cdx)
            df = pd.concat([df, new_rows])
        except:
            print(f'There was an error converting {cdx_url} to a dataframe. This may be\
                  due to a malformed CDX file. This data will be skipped.')
    return df

def fetch_text(row: pd.Series):
    """
    Function that takes a `pandas.Series`, which is a single row 
    from a `pandas.DataFrame`, as an argument.
    The functions uses the timestamp and original fields from the `row`
    to request the specific resource from  OpenWayback. Once the resource is 
    fetched, the Wayback banner div elements are removed so as to not detract 
    from the words in the resource itself. 
    Args:
        row (pandas.Series): `pandas.Series`, which is a single row 
    from a `pandas.DataFrame`.

    Returns:
        `String` of the resource's text. If an error is encountered, returns 
        an empty string.
    """
    playback_url = row['original']
    if (row['timestamp'] is None) or (row['timestamp']==''):
        print(f'CDX row is missing timestamp. Not retrieving text for {playback_url}')
        return ''
    timestamp = row['timestamp']
    if timestamp.startswith('2000'):
        base_url = WAYBACK_LEGACY_BASE_URL
    else:
        base_url = WAYBACK_BASE_URL
    is_blocked, response = make_request(f"{base_url}{timestamp}/{playback_url}", pause=15)
    if response is None:
        print(f'Error retrieving {base_url}{timestamp}/{playback_url}. Skipping full text for this document.')
        return ''
    if is_blocked is True:
        print(f'429 too many requests. Skipping: {base_url}{timestamp}/{playback_url}')
        return 429
    try:
        soup = BeautifulSoup(response.text, 'html.parser')
        [el.extract() for el in soup.find_all('div', {'id': 'wm-maximized'})]
        [el.extract() for el in soup.find_all('div', {'id': 'wm-minimized'})]
        return soup.text
    except:
        print(f'Error parsing full text from {base_url}{timestamp}/{playback_url}. Skipping full text for this document.')
        return ''

def fetch_all_text(df: pd.DataFrame):
    """
    Function that takes a `pandas.Dataframe` as an argument.
    This is the most complicated function here. The function first cleans the
    `df` that was passed in by dropping all the rows that do not have a value in the
    mimetype field. Then, it drops all the duplicate digests, which removes resources
    that are exactly the same. Finally, it only returns rows that have 'text' in the 
    mimetype field and have a '200' or '-' HTTP status response.
    Once the `df` is cleaned, each resource's text is fetched from the Wayback,
    transformed into a matrix using `sklearn.CountVectorizer`, and then returns a `pandas.DataFrame`
    with words and their occurance per resource. A politeness of 15 seconds is added between Wayback requests.
    Args:
        row (pandas.DataFrame): `pandas.Dataframe` representing web resources as CDX lines.

    Returns:
        `pandas.Dataframe` of the resource's words tabulated per web resource.
    """
    countvec = CountVectorizer(ngram_range=(1,1), stop_words='english')
    unprocessed_bag_of_words = []
    text_df = df\
        .dropna(subset=['mimetype'])\
        .drop_duplicates(subset=['digest'])\
        .query(
            (
                ('statuscode.str.match("200") or statuscode.str.match("-")') and 
                ('mimetype.str.contains("text")')   
            ), 
            engine='python'
        )
    for i, row in text_df.iterrows():
        fetched_text = fetch_text(row)
        if fetched_text == 429:
            print('Haulting requests for web archives. Received a 429 error from the server, which means too many requests too quickly.')
            break
        unprocessed_bag_of_words.append(fetched_text)
        
    
    processed_bag_of_words = countvec.fit_transform(unprocessed_bag_of_words)
    
    return pd.DataFrame(processed_bag_of_words.toarray(),columns=countvec.get_feature_names_out())

Gathering the list of CDX Files#

The first step is gathering the list of CDX files. To do that, simply call the gather_files_from_manifest function, providing the Election year as an argument.

el00_files = gather_files_from_manifest('2000')

Let’s look at our first five files:

el00_files[:5]
['https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415093936.surt.cdx.gz',
 'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415094743.surt.cdx.gz',
 'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095044.surt.cdx.gz',
 'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095244.surt.cdx.gz',
 'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095459.surt.cdx.gz']

Inspect a sample CDX File#

Next, we’ll demonstrate what a particular CDX File looks like. We’ll look at the first five lines of our first CDX from 2000.

cdx = fetch_file(el00_files[0])
cdx[:5]
['com,voter)/home/candidates/info/0,1214,2-11880-,00.html 20001002182124 http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html text/html 200 FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP - - 5051 149 unique.20010415093936.arc.gz',
 'com,voter)/home/candidates/info/0,1214,2-18885-,00.html 20001002185814 http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html text/html 200 H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL - - 4829 5200 unique.20010415093936.arc.gz',
 'com,voter)/home/candidates/info/0,1214,2-18880-,00.html 20001002185815 http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html text/html 200 HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 - - 4794 10029 unique.20010415093936.arc.gz',
 'com,voter)/home/officials/general/1,1195,2-2467-,00.html 20001002185815 http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html text/html 200 HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O - - 5282 14823 unique.20010415093936.arc.gz',
 'com,voter)/home/candidates/info/0,1214,2-18886-,00.html 20001002185816 http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html text/html 200 QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO - - 4823 20105 unique.20010415093936.arc.gz']

Now, here is the same CDX transformed into a DataFrame

cdx_df = create_dataframe(cdx)
cdx_df
urlkey timestamp original mimetype statuscode digest redirect metatags file_size offset warc_filename
0 com,voter)/home/candidates/info/0,1214,2-11880-,00.html 20001002182124 http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html text/html 200 FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP - - 5051 149 unique.20010415093936.arc.gz
1 com,voter)/home/candidates/info/0,1214,2-18885-,00.html 20001002185814 http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html text/html 200 H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL - - 4829 5200 unique.20010415093936.arc.gz
2 com,voter)/home/candidates/info/0,1214,2-18880-,00.html 20001002185815 http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html text/html 200 HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 - - 4794 10029 unique.20010415093936.arc.gz
3 com,voter)/home/officials/general/1,1195,2-2467-,00.html 20001002185815 http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html text/html 200 HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O - - 5282 14823 unique.20010415093936.arc.gz
4 com,voter)/home/candidates/info/0,1214,2-18886-,00.html 20001002185816 http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html text/html 200 QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO - - 4823 20105 unique.20010415093936.arc.gz
... ... ... ... ... ... ... ... ... ... ... ...
1096875 com,voter)/home/candidates/info/0,1214,2-9118-,00.html 20001002183052 http://www.voter.com:80/home/candidates/info/0,1214,2-9118-,00.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 118 145323588 unique.20010415093936.arc.gz
1096876 com,voter)/home/candidates/info/0,1214,2-9115-,00.html 20001002183052 http://www.voter.com:80/home/candidates/info/0,1214,2-9115-,00.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 118 145323706 unique.20010415093936.arc.gz
1096877 com,voter)/home/candidates/info/0,1214,2-15361-,00.html 20001002182249 http://www.voter.com:80/home/candidates/info/0,1214,2-15361-,00.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 119 145323824 unique.20010415093936.arc.gz
1096878 com,voter)/home/candidates/info/0,1214,2-12994-,00.html 20001002181842 http://www.voter.com:80/home/candidates/info/0,1214,2-12994-,00.html text/html 404 UDSH36NBYWO2X73LNMX2LEHLNQ7FYXHZ - - 351 145323943 unique.20010415093936.arc.gz
1096879 None None None None None None None None None None None

1096880 rows × 11 columns

Election 2000 DataFrame#

Now we’ll create a DataFrame from the first fifteen CDX files in the 2000 election subset. To do that, we’ll use the create_dataframe_from_manifest which loops over the CDX files and calls create_dataframe programmatically instead of manually and individually as we did above.

If we had more time or were working on a more powerful computer, we’d pull from all of the files in the 2000 subset, but for now we’ll just pull from the first ten.

el00_df = create_dataframe_from_manifest(el00_files[0:15])
el00_df
urlkey timestamp original mimetype statuscode digest redirect metatags file_size offset warc_filename
0 com,voter)/home/candidates/info/0,1214,2-11880-,00.html 20001002182124 http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html text/html 200 FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP - - 5051 149 unique.20010415093936.arc.gz
1 com,voter)/home/candidates/info/0,1214,2-18885-,00.html 20001002185814 http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html text/html 200 H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL - - 4829 5200 unique.20010415093936.arc.gz
2 com,voter)/home/candidates/info/0,1214,2-18880-,00.html 20001002185815 http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html text/html 200 HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 - - 4794 10029 unique.20010415093936.arc.gz
3 com,voter)/home/officials/general/1,1195,2-2467-,00.html 20001002185815 http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html text/html 200 HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O - - 5282 14823 unique.20010415093936.arc.gz
4 com,voter)/home/candidates/info/0,1214,2-18886-,00.html 20001002185816 http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html text/html 200 QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO - - 4823 20105 unique.20010415093936.arc.gz
... ... ... ... ... ... ... ... ... ... ... ...
338148 org,ctgop)/county/tolland.htm 20001006073643 http://www.ctgop.org:80/county/tolland.htm - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 101 79251104 unique.20010415101811.arc.gz
338149 org,ctgop)/county/tolland.htm 20001005073549 http://www.ctgop.org:80/county/tolland.htm - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 101 79251205 unique.20010415101811.arc.gz
338150 org,ctgop)/county/tolland.htm 20001004073505 http://www.ctgop.org:80/county/tolland.htm - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 101 79251306 unique.20010415101811.arc.gz
338151 org,ctgop)/county/tolland.htm 20001003073437 http://www.ctgop.org:80/county/tolland.htm text/html 200 TIRWMHRDJ5L22TJWCXVA6TNU5YOB65SW - - 1421 79251407 unique.20010415101811.arc.gz
338152 None None None None None None None None None None None

1541579 rows × 11 columns

Mimetypes#

For this exercise, we’re going to take a brief look at the mimetypes. First, we’ll select all the mimetypes in the Dataframe and get their sums by calling value_counts which is a method from Pandas.

el00_mimetypes = el00_df['mimetype'].value_counts()
el00_mimetypes
mimetype
-                           1493256
text/html                     43969
image/jpeg                     2756
image/gif                      1311
application/pdf                 122
text/plain                       59
image/bmp                        28
audio/x-pn-realaudio             18
application/msword               11
text/css                          4
image/png                         4
application/octet-stream          3
application/x-javascript          3
video/quicktime                   3
application/zip                   2
audio/x-wav                       2
audio/midi                        2
text/xml                          2
application/mac-binhex40          1
audio/x-aiff                      1
image/tiff                        1
application/x-tar                 1
application/x-pointplus           1
audio/x-midi                      1
video/x-msvideo                   1
audio/basic                       1
audio/x-mpeg                      1
Name: count, dtype: int64

Filtering by domain#

Let’s now look at the domains and subdomains represented in the 2000 CDX files. We’ll ignore the “www” part of URLs, but otherwise retain subdomains.

import re # For using regular expressions to remove parts of URLs
from urllib.parse import urlparse # For locating the base domain in URLs

def get_domains(urls):
    if urls is None:
        return []
    if type(urls) == str:
        urls = [urls]
    domains = set()
    for url in urls:
        parsed_url = urlparse(url)
        domain = parsed_url.netloc
        if type(domain) == bytes:
            domain = None

        # Remove "www." and ports if they exist
        if domain is None or domain == '':
            continue
        else:
            # Remove www., www1., etc.
            domain = re.sub(r"www\d?\.(.*)", r"\1", domain)
            # Remove ports, as in some-website.com:80
            domain = domain.split(':')[0]
            domains.add(domain)
    return list(domains)

el00_df['domains'] = el00_df['original'].apply(get_domains).str[0]
for cdx_domain in el00_df['domains'].unique():
    print(cdx_domain)
voter.com
whitehouse.gov
hayes.voter.com
freespeech.org
cnn.com
freedomchannel.com
essential.org
fsudemocrats.org
commoncause.org
democrats.org
uspolitics.about.com
enterstageright.com
reason.com
usconservatives.about.com
idaho-democrats.org
usliberals.about.com
www10.nytimes.com
election.voter.com
graphics.nytimes.com
nydems.org
adams.voter.com
mockelection.com
rpk.org
dnet.org
commonconservative.com
beavoter.org
beavoter.com
iowademocrats.org
forums.about.com
thiselection.com
indiaelection.com
server1.dscc.org
search.algore2000.com
forums.nytimes.com
azlp.org
intellectualcapital.com
prospect.org
grassroots.com
rnc.org
lwv.org
mn-politics.com
newwestpolitics.com
popandpolitics.com
washingtonpost.com
nacdnet.org
lp.org
algore2000.com
crlp.org
harrybrowne2000.org
ga.lp.org
emilyslist.org
ncgop.org
arkdems.org
cbdnet.org
keyes-grassroots.com
faqvoter.com
americanprospect.org
partners.nytimes.com
indems.org
ageofreason.com
vanishingvoter.org
nyc.dnet.org
robots.cnn.com
informedvoter.com
virginiapolitics.com
newpolitics.com
nan
md.lp.org
ca-dem.org
beachdemocrats.org
ohiodems.org
maryland.reformparty.org
muscatinedemocrats.org
9thdistgagop.org
rcdnet.org
azgop.org
maricopagop.org
kansas.reformparty.org
newjersey.reformparty.org
california.reformparty.org
timeline.reformparty.org
algop.org
pelicanpolitics.com
espanol.voter.com
gorelieberman.com
election.com
ceednet.org
followthemoney.org
debates.org
cagop.org
wsrp.org
indgop.org
members.freespeech.org
schoolelection.com
convention.texasgop.org
cal.votenader.org
candidate.grassroots.com
1-877-leadnow.com
madison.voter.com
sierraclub.org
mt.nacdnet.org
ma.lwv.org
irchelp.org
calvoter.org
njdems.org
sfvoter.com
vademocrats.org
reformparty.org
missouridems.org
pa.lwv.org
akdemocrats.org
njlp.org
hagelin.org
keyes2000.org
tray.com
nrsc.org
deldems.org
nrcc.org
ksdp.org
kansasyoungdemocrats.org
washington.reformparty.org
dems2000.com
arkgop.com
scdp.org
plp.org
votenader.org
votenader.com
northcarolina.reformparty.org
ca.lwv.org
ks.nacdnet.org
txdemocrats.org
politics1.com
gagop.org
slp.org
gwbush.com
akrepublicans.org
wi.nacdnet.org
green.votenader.org
rpv.org
fec.gov
nytimes.com
naacp.org
hawaiidemocrats.org
nygop.org
gopatgo2000.org
democratsabroad.org
pub.whitehouse.gov
archive.lp.org
gop-mn.org
migop.org
ca.lp.org
monmouthlp.org
ncdp.org
cologop.org
mi.lp.org
cobbdemocrats.org
tx.lp.org
campaignoffice.com
freetrial.campaignoffice.com
calendar.rnc.org
rireformparty.org
ehdemocrats.org
poll1.debates.org
nevadagreenparty.org
newvoter.com
mi.lwv.org
georgia.reformparty.org
delaware.reformparty.org
stonewalldfl.org
santacruzlp.org
forums.hagelin.org
forum.hagelin.org
iowagop.org
ohiogop.org
sddemocrats.org
skdemocrats.org
wisdems.org
sfgreenparty.org
il.lp.org
rtumble.com
ctdems.org
alaskarepublicans.com
detroitnaacp.org
greenparty.org
ndgop.com
nh-democrats.org
rosecity.net
sandiegovoter.com
montanagop.org
dc.reformparty.org
greenparties.org
mainegop.com
stmarysdemocrats.org
comalcountydemocrats.org
masonforrnc.org
sblp.org
chesapeakedemocrats.org
tejanodemocrats.org
connecticut.georgewbush.com
students.georgewbush.com
youngprofessionals.georgewbush.com
maine.georgewbush.com
latinos.georgewbush.com
veterans.georgewbush.com
africanamericans.georgewbush.com
missouri.georgewbush.com
agriculture.georgewbush.com
mississippi.georgewbush.com
minnesota.georgewbush.com
arizona.georgewbush.com
northcarolina.georgewbush.com
virginia.georgewbush.com
kentucky.georgewbush.com
texas.georgewbush.com
lvvlwv.org
kansassenatedemocrats.org
nhgop.org
nebraskademocrats.org
southcarolina.reformparty.org
tndemocrats.org
fcncgop.org
padems.com
gore-2000.com
union.arkdems.org
illinois.reformparty.org
nevadagop.org
rhodeisland.reformparty.org
massdems.org
allencountydemocrats.org
mogop.org
oklahoma.reformparty.org
oklp.org
speakout.com
windemocrats.org
washingtoncountydemocrats.org
salinecodemocrats.org
njgop.org
sddp.org
pennsylvania.reformparty.org
lademo.org
allgore.com
web.democrats.org
pagop.org
library.whitehouse.gov
docs.whitehouse.gov
idaho.reformparty.org
alaska.net
georgybush.com
rpof.org
publishing1.speakout.com
de.lp.org
mainedems.org
clarkgop.com
kansashousedemocrats.org
georgiaparty.com
la.lp.org
ny.lp.org
nebraska.reformparty.org
maine.reformparty.org
indiana.reformparty.org
myweb.clark.net
clark.net
ga.lwv.org
traviscountydemocrats.org
cheshiredemocrats.org
exchange.nrcc.org
growthelp.org
sbdemocrats.org
montana.reformparty.org
politicalshop.com
massgop.com
ohio.reformparty.org
scgop.com
wvgop.org
c-span.org
westvirginia.reformparty.org
wwwalgore2000.com
texas.reformparty.org
florida-democrats.org
delawaregop.com
publicrelations.reformparty.org
nj.nacdnet.org
ohionlp.org
communications.reformparty.org
newhampshire.reformparty.org
aladems.org
arkansas.reformparty.org
avlp.org
vtdemocrats.org
jackgreenlp.org
waynegop.org
mi-democrats.com
13thdistrictdems.org
rules.reformparty.org
negop.org
dscc.org
mccain2000.com
oclp.org
ilgop.org
hawaii.reformparty.org
arch-cgi.lp.org
crnc.org
sc.ca.lp.org
8thcd.vademocrats.org
foreignpolicy2000.org
bradely.campaignoffice.com
wwwsanderson.campaignoffice.com
florida.reformparty.org
al.lp.org
dpo.org
oahudemocrats.org
columbia.arkdems.org
kentucky.reformparty.org
phoenixnewtimes.com
purepolitics.com
concernedvoter.com
iowa.reformparty.org
wyoming.reformparty.org
harriscountygreenparty.org
american-politics.com
issues.reformparty.org
nysrtlp.org
stpaul.mn.lwv.org
arlingtondemocrats.org
okgop.com
utahgop.org
utdemocrats.org
mississippi.reformparty.org
plymouth.ma.nacdnet.org
tennessee.reformparty.org
minnesota.reformparty.org
dpnm.org
georgebush2000.com
vayoungdemocrats.org
northdakota.reformparty.org
stonewalldemocrats.org
virginia.reformparty.org
fastlane.net
youngdemocrats.org
msgop.org
calgop.org
votegrassroots.com
wvdemocrats.com
housedems2000.com
lubbockdemocrats.org
ildems.org
okdemocrats.org
lccdnet.org
fecweb1.fec.gov
trinity.ca.lp.org
ventura.ca.lp.org
3rdcd.vademocrats.org
de.lwv.org
mdgop.org
flgopsenate.campaignoffice.com
bradley.campaignoffice.com
kydems.campaignoffice.com
tx.nacdnet.org
mo.nacdnet.org
texasgop.org
in.rcdnet.org
life.ca.lp.org
victory.texasgop.org
charlestondemocrats.org
wyomingdemocrats.com
nd.nacdnet.org
college.reformparty.org
al.nacdnet.org
nddemnpl.campaignoffice.com
kulick-jackson.campaignoffice.com
wasiluk.campaignoffice.com
hilstrom.campaignoffice.com
schumacher.campaignoffice.com
dfl.org
slawik.campaignoffice.com
markthompson.campaignoffice.com
rest.campaignoffice.com
vigil.campaignoffice.com
graves.campaignoffice.com
mcinnis.campaignoffice.com
hoosier.campaignoffice.com
connor.campaignoffice.com
bernardy.campaignoffice.com
housedflcaucus.campaignoffice.com
goodwin.campaignoffice.com
peaden.campaignoffice.com
kurita.campaignoffice.com
hi.lp.org
mtdemocrats.org
or.nacdnet.org
kcswcd.mo.nacdnet.org
id.nacdnet.org
sd.nacdnet.org
ny.nacdnet.org
yvote2000.com
oh.nacdnet.org
va.nacdnet.org
tn.nacdnet.org
fl.nacdnet.org
ca.nacdnet.org
co.nacdnet.org
ky.lp.org
georgewbush.com
massachusetts.reformparty.org
arizona.reformparty.org
louisiana.reformparty.org
nm.nacdnet.org
tazewell.va.nacdnet.org
aflcio.org
azdem.org
columbiana.oh.nacdnet.org
lacledeswcd.mo.nacdnet.org
reclaimdemocracy.org
ctgop.org
nevada.reformparty.org
in.nacdnet.org
michigan.reformparty.org
newyork.reformparty.org
nc.nacdnet.org
wa.nacdnet.org
ak.nacdnet.org
pa.nacdnet.org
billbradley.com
macdnet.org
lmcd.mt.nacdnet.org
socialdemocrats.org
bexardemocrats.org
alabama.reformparty.org
globalelection.com
wisconsin.reformparty.org
geocities.com
coloradodems.org

As you can see from the list above, the 2000 CDX files include content from a wide range of domains, not limited to political candidate campaign websites.

In the early years of the United States Elections Web Archive, the scope of the collection included websites of political parties, government, advocacy groups, bloggers, and other individuals and groups expressing relevant views. These sites have generally been moved into the Public Policy Topics Web Archive or into the general web archives. However, the CDX files index the content as it was originally captured. The CDX files may also index content from non-candidate resources if candidate sites linked to those resources or embedded that content. Occassionally, other out of scope content may also appear in CDX files otherwise dedicated to U.S. elections.

Let’s grab only those lines from the CDX files that match domains from the candidate websites in our metadata.csv file. We’ll include the campaign candidates’ websites themselves, as well as any domains that appear in the scope column. Domains that appear in the scope column are additional URLs that the web archiving crawler was instructed to collect in addition to the campaign website, if the campaign website linked to those URLs. For a more refined description, see this data package’s README.

year = 2000

def get_domains_by_year(year):
    year_metadata = metadata_df[metadata_df['collection'].str.contains(year)].copy()
    if len(year_metadata) > 0:
        year_metadata['seeds_domains'] = year_metadata['website_url'].apply(get_domains)
        year_metadata['scope_domains'] = year_metadata['website_scopes'].apply(get_domains)
        year_metadata['all_domains'] = year_metadata['seeds_domains'] + year_metadata['scope_domains']
        all_domains = [item for sublist in year_metadata['all_domains'].dropna() for item in sublist]
        return list(set(all_domains))
    else:
        print(f'Sorry, there were no rows in metadata.csv for content from {year}')

metadata_domains = get_domains_by_year(str(year))
print(f'Domains from the {str(year)} US Elections collection:')
metadata_domains
Domains from the 2000 US Elections collection:
['algore2000.com',
 'harrybrowne2000.org',
 'gopatgo2000.org',
 'algore.com',
 'keyes2000.org',
 'hagelin.org',
 'forum.hagelin.org']

Now we’re ready to filter. Let’s filter down our sample CDX lines to just those lines that point to the candidate website domains from metadata.csv, listed above. This means we’ll only include CDX rows for domains like algore2000.org, gopatgo2000.org, but not sites like vote.com or whitehouse.gov.

cdx_candidate_domains_el00 = el00_df[
    el00_df['original'].apply(
        lambda url: 
            any(domain in url for domain in metadata_domains) if url 
            else False
    )
]
cdx_candidate_domains_el00
urlkey timestamp original mimetype statuscode digest redirect metatags file_size offset warc_filename domains
25175 com,algore2000,search)/search 20001030063531 http://search.algore2000.com:80/search/ - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 97 7471624 unique.20010415093936.arc.gz search.algore2000.com
26166 com,algore2000,search)/search 20001030053022 http://search.algore2000.com:80/search/ - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 97 7587973 unique.20010415093936.arc.gz search.algore2000.com
49892 com,algore2000,search)/search 20001029053020 http://search.algore2000.com:80/search/ - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 97 10612154 unique.20010415093936.arc.gz search.algore2000.com
73526 com,algore2000,search)/search 20001028053001 http://search.algore2000.com:80/search/ - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 99 13619683 unique.20010415093936.arc.gz search.algore2000.com
97191 com,algore2000,search)/search 20001027053201 http://search.algore2000.com:80/search/ - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 98 16632272 unique.20010415093936.arc.gz search.algore2000.com
... ... ... ... ... ... ... ... ... ... ... ... ...
336264 org,keyes2000)/images/newsimage.jpg 20001003073434 http://keyes2000.org:80/images/newsimage.jpg image/jpeg 200 LWERVVNORJQ6IBZCJ4SBNH26JU6NH3MV - - 13527 76178594 unique.20010415101811.arc.gz keyes2000.org
336611 com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html 20001004075816 http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 130 76906140 unique.20010415101811.arc.gz algore2000.com
336612 com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html 20001004073516 http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 127 76906270 unique.20010415101811.arc.gz algore2000.com
336613 com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html 20001003075840 http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html - - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 130 76906397 unique.20010415101811.arc.gz algore2000.com
336614 com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html 20001003073434 http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html text/html 200 6Y6BX6SUDNF5CASBJH2LASINQ46ASMQF - - 9606 76906527 unique.20010415101811.arc.gz algore2000.com

6448 rows × 12 columns

Fetching the Text#

Now that we know the majority of the remaining resources in this dataset have a text-based mimetype, we can gather all the text and do some basic analysis. First, we’ll fetch all the text from just 50 rows. This will take a few minutes.

text_df = fetch_all_text(cdx_candidate_domains_el00.tail(50))

Top 25 Words#

Now that the text has been fetched, we’ll do a simple summation and sorting, displaying the top 25 words from the first 50 rows of the 2000 Election dataset.

text_df.sum(axis=0).sort_values(ascending=False).head(25)
tax              34
bush             34
cut              18
plan             16
republicans      11
new              11
gop               9
republican        7
budget            7
vote              7
00                7
gore              7
senate            6
committee         6
year              6
george            6
news              5
americans         5
2000              5
congressional     5
york              5
dakota            4
carolina          4
virginia          4
debt              4
dtype: int64