LoC Data Package Tutorial: United States Elections, Web Archives Data Package#
version 2.0
This notebook will demonstrate basic usage of using Python for interacting with data packages from the Library of Congress via the United States Elections, Web Archives Data Package which is derived from the Library’s United States Elections Web Archive. We will:
Prerequisites#
In order to run this notebook, please follow the instructions listed in this directory’s README.
Output data package summary#
First, we will select United States Elections, Web Archives Data Package and output a summary of its files.
import ast # For reading structured data from metadata.csv
import pandas as pd # For reading, manipulating, and displaying data
import requests # For retrieving online files
import sys # For general system tasks
from helpers import get_file_stats, make_request
# Set general variables we'll use throughout
DATA_URL = 'https://data.labs.loc.gov/us-elections/' # Base URL of this data package
PYTHON_VERSION = sys.version.split('|')[0] # We will use this in our request headers
HEADERS = { # This allows us to declare ourselves to Library of Congress servers
'User-Agent':f'https://github.com/LibraryOfCongress/data-exploration/blob/master/Data Packages/us-elections.ipynb : 2.0 (python : {PYTHON_VERSION})'
}
# Download the file manifest
file_manifest_url = f'{DATA_URL}manifest.json'
is_blocked, response = make_request(file_manifest_url, json=True)
if response is None:
print(f'There was an error retrieving the manifest file at {DATA_URL}manifest.json')
files = [dict(zip(response["cols"], row)) for row in response["rows"]] # zip columns and rows
# Convert to Pandas DataFrame and show stats table
stats = get_file_stats(files)
pd.DataFrame(stats)
FileType | Count | Size | |
---|---|---|---|
0 | .gz | 394,950 | 227.8GB |
Query the metadata in the data package#
Next we will download this data package’s metadata.csv
file, print a summary of various values, and demonstrate filtering options.
The metadata.csv
file lists all of the US election political candidates websites that have been collected as part of the United States Elections Web Archive and which are expected to be indexed in this data package’s CDX index files. To read more about this data package’s scope, see its README
.
Because the CDX index files are a mixed bag of additional content, the metadata.csv
file can be used to target content just from political candidate website domains.
metadata_url = f'{DATA_URL}metadata.json'
is_blocked, response = make_request(metadata_url, headers=HEADERS)
data = response.json()
metadata_df = pd.DataFrame(data)
print(f'Loaded metadata file with {len(metadata_df):,} entries.')
Loaded metadata file with 13,388 entries.
Next let’s convert to pandas DataFrame and print the available properties.
# metadata_df = pd.read_csv(r"C:\Users\rtrent\git\lcwa-election-datasets\metadata\full_metadata_2000-2016.csv", dtype=str) # just for testing
print(', '.join(metadata_df.columns.to_list()))
item_id, item_title, website_url, website_id, website_scopes, collection, website_elections, website_parties, website_places, website_districts, website_thumbnail, website_start_date, website_end_date, item_all_years, website_all_years, mods_url, access_condition
Let’s check the campaign years represented in metadata.csv
.
collections = metadata_df['collection'].dropna().unique()
years = [collection.split(', ')[1] for collection in collections]
years.sort()
years
['2000', '2002', '2004', '2006', '2008', '2010', '2012', '2014', '2016']
Interpreting the metadata fields#
The fields are defined in this package’s README
. Each row is a particular website collected for a specific candidate in a single election year.
Let’s look at an example row to understand how to interpret the fields. We’ll write out a paragraph describing our example row. We’ll look at row #3460
(which we happen to know represents the only candidate in metadata.csv
to have campaigned in two races in the same year under different parties):
# First, let's make sure that our dataframe columns with lists are interpretted correctly.
metadata_df['website_elections'] = metadata_df['website_elections'].apply(ast.literal_eval)
metadata_df['website_parties'] = metadata_df['website_parties'].apply(ast.literal_eval)
metadata_df['website_places'] = metadata_df['website_places'].apply(ast.literal_eval)
metadata_df['website_districts'] = metadata_df['website_districts'].apply(ast.literal_eval)
metadata_df['item_all_years'] = metadata_df['item_all_years'].apply(ast.literal_eval)
metadata_df['website_all_years'] = metadata_df['website_all_years'].apply(ast.literal_eval)
metadata_df['website_scopes'] = metadata_df['website_scopes'].apply(ast.literal_eval)
row = 3460 # You can change this row number
# We'll grab all the info we need from our row.
item_title = metadata_df.iloc[row]['item_title']
website_url = metadata_df.iloc[row]['website_url']
collection = metadata_df.iloc[row]['collection']
candidate_name = item_title.split('-')[1].strip()
year = collection.split(',')[1].strip()
campaign_count = len(metadata_df.iloc[row]['website_elections'])
website_elections = metadata_df.iloc[row]['website_elections']
website_parties = metadata_df.iloc[row]['website_parties']
website_places = metadata_df.iloc[row]['website_places']
website_districts = metadata_df.iloc[row]['website_districts']
website_all_years = metadata_df.iloc[row]['website_all_years']
website_all_years.sort()
item_all_years = metadata_df.iloc[row]['item_all_years']
item_all_years.sort()
item_id = metadata_df.iloc[row]['item_id']
mods_url = metadata_df.iloc[row]['mods_url']
# Now we'll plug those variables into our sentences.
print(f'Record #{row} in the metadata.csv is: {website_url}, from the collection "{collection}".')
print(f'This row represents the website in {year}, used for campaign(s) of the candidate: {candidate_name}.')
print(f'In {year}, this candidate used this website in {campaign_count} campaign(s):')
i=0
while i < campaign_count:
if website_districts[i] is None:
house_district = ''
else:
house_district = website_districts[i]
print(f' {i}. {website_elections[i]} | {website_parties[i]} | {website_places[i]} | {house_district}')
i += 1
if len(website_all_years)>1:
print(f'This website ({website_url}) was also used for these other campaign year(s) for {candidate_name}: {list(set(website_all_years)-set(year))}')
print(f'In total, this and possibly other websites were collected for this candidate in the following year(s): {list(set(item_all_years))}')
print(f'The loc.gov item record for {candidate_name} campaign sites can be viewed at {item_id}, and its MODS record can be viewed at {mods_url}.')
# The next line displays our dataframe as a table. Let's set it to show up to 300 characters in each cell
pd.options.display.max_colwidth = 300
print('Here is how this row appears in `metadata.csv`:')
metadata_df[row:row+1]
Record #3460 in the metadata.csv is: http://www.usmjp.com/, from the collection "United States Elections, 2012".
This row represents the website in 2012, used for campaign(s) of the candidate: Cris Ericson.
In 2012, this candidate used this website in 2 campaign(s):
0. United States. Congress. Senate | U.S. Marijuana Party | Vermont |
1. Vermont. Governor | Independent candidates | Vermont |
In total, this and possibly other websites were collected for this candidate in the following year(s): [2018, 2002, 2004, 2006, 2008, 2010, 2012]
The loc.gov item record for Cris Ericson campaign sites can be viewed at http://www.loc.gov/item/lcwaN0002501/, and its MODS record can be viewed at https://tile.loc.gov/storage-services/service/webcapture/project_1/mods/united-states-elections-web-archive/lcwaN0002501.xml.
Here is how this row appears in `metadata.csv`:
item_id | item_title | website_url | website_id | website_scopes | collection | website_elections | website_parties | website_places | website_districts | website_thumbnail | website_start_date | website_end_date | item_all_years | website_all_years | mods_url | access_condition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3460 | http://www.loc.gov/item/lcwaN0002501/ | Official Campaign Web Site - Cris Ericson | http://www.usmjp.com/ | 3415 | [http://crisericson.com, http://vermontnews.livejournal.com, http://www.myspace.com/usmjp2010, http://crisericson2010.blogspot.com] | United States Elections, 2012 | [United States. Congress. Senate, Vermont. Governor] | [U.S. Marijuana Party, Independent candidates] | [Vermont, Vermont] | [None, None] | http://cdn.loc.gov/service/webcapture/project_1/thumbnails/lcwaS0003415.jpg | 20121003 | 20121019 | [2002, 2004, 2004, 2006, 2008, 2010, 2012, 2012, 2018, 2018] | [2012] | https://tile.loc.gov/storage-services/service/webcapture/project_1/mods/united-states-elections-web-archive/lcwaN0002501.xml | None |
Now let’s see about all the Vermont gubernatorial candidates represented in this data package.
# We'll create a function to generate summary information about a given type of election
def election_summary(election_type):
websites_by_year = metadata_df[metadata_df['website_elections'].apply(lambda elections: any(election_type is election for election in elections ))]
candidates = websites_by_year['item_title'].unique()
websites = websites_by_year['website_url'].unique()
years = [collection.split(',')[1].strip() for collection in websites_by_year['collection'].unique()]
min_year = min(years) if years else 'n/a'
max_year = max(years) if years else 'n/a'
multi_year_websites = websites_by_year[websites_by_year['website_all_years'].str.len()>1]['website_url'].unique()
print(f'Found in metadata.csv: {len(websites)} unique campaign websites for {len(candidates)} "{election_type}" candidates, ranging from years {min_year} - {max_year}.')
print(f'{len(multi_year_websites)} of these websites were used multiple years.')
election_summary('Vermont. Governor')
Found in metadata.csv: 0 unique campaign websites for 0 "Vermont. Governor" candidates, ranging from years n/a - n/a.
0 of these websites were used multiple years.
Off-year elections aren’t represented in this data package even though they are in the United States Elections Web Archive online collection. This is due to the way that content is organized in CDX files.
For example, Virginia’s gubernatorial elections are off-year elections (in odd-numbered years), and thus are not represented in this data package even though they are in the online collection.
After you run the next cell, try replacing “Virginia. Governor” with something like “United States. Congress. Senate”, “United States. President”, or “Michigan. Governor”
election_summary('Virginia. Governor')
Found in metadata.csv: 0 unique campaign websites for 0 "Virginia. Governor" candidates, ranging from years n/a - n/a.
0 of these websites were used multiple years.
Filter and Download CDX index files, analyze text#
The bulk of this dataset are CDX files. In this section, we’ll retrieve a small sample of those CDX files and analyze the text inside them.
Here we will define the functions in the order that they are used in this section of the notebook.
from bs4 import BeautifulSoup # Used to process the scraped content
import gzip # Used to decompress the gzipped CDX files
from sklearn.feature_extraction.text import CountVectorizer # Used to create a matrix out of a bag of words
from time import sleep # Used to provide a slight pause between requests
WAYBACK_BASE_URL = 'https://webarchive.loc.gov/all/'
WAYBACK_LEGACY_BASE_URL = 'https://webarchive.loc.gov/legacy/'
def gather_files_from_manifest(year: str):
"""
Function that takes a year (YYYY) as an argument.
The function collects the locations of the CDX files
listed by the provided year's manifest.
Args:
year (str): String of a year YYYY.
Returns:
:obj:`list` of :obj:`str` of individual CDX file URLs. In case
of error, returns an empty list.
"""
election_years = [
"2000",
"2002",
"2004",
"2006",
"2008",
"2010",
"2012",
"2014",
"2016"
]
if year not in election_years:
return []
else:
try:
manifest_url = f"{DATA_URL}by-year/{year}/manifest.html"
is_blocked, response = make_request(manifest_url)
soup = BeautifulSoup(response.content, 'html.parser')
cdx_files = [link.get('href') for link in soup.find_all('a')]
return cdx_files
except:
print(f'There was an error retrieving and/or parsing {manifest_url}.')
return []
def fetch_file(cdx_url: str):
"""
Function that takes a `String` as an argument.
The `cdx_url` is a singular item from the result
of the `gather_files_from_manifest` function.
The function fetches the gzipped CDX file, decompresses it,
splits it on the newlines, and removes the header.
Args:
cdx_url (str): Individual item from the result of
the `gather_files_from_manifest` function.
Returns:
:obj:`list` of :obj:`str` of individual CDX lines, each representing
a web object. Returns an empty list in case of errors.
"""
# Get the CDX file. For a production script, you'll want to build in additional error handling.
try:
response = requests.get(cdx_url)
except:
response = None
# Here we decompress the gzipped CDX, decode it, split it on the newline, and remove the header
try:
cdx_content = gzip.decompress(response.content).decode('utf-8').split('\n')[1:]
return cdx_content
except:
print(f'There was an error parsing the decompressing CDX file: {cdx_url}. This file will be skipped.')
return []
def create_dataframe(data: list):
"""
Function that takes a :obj:`list` of :obj:`str` as an argument.
`data` is the contents of the CDX file split on newlines.
This function takes `data`, applies a schema to it, and transforms it
into a `pandas.DataFrame`.
Args:
data (list): :obj:`list` of :obj:`str`. Each item is a line from
a CDX file or group of files.
Returns:
A `pandas.DataFrame` of a CDX file or group of files. In case of error,
a blank pandas.DataFrame is returned.
"""
schema = [
'urlkey',
'timestamp',
'original',
'mimetype',
'statuscode',
'digest',
'redirect',
'metatags',
'file_size',
'offset',
'warc_filename'
]
try:
_data = [row.split() for row in data]
df = pd.DataFrame(_data, columns=schema)
return df
except:
print('There was an error converting the data into a dataframe. Returning\
a blank dataframe.')
return pd.DataFrame()
def create_dataframe_from_manifest(manifest: list):
"""
Function that takes a :obj:`list` of :obj:`str` as an argument.
The `manifest` is a list of all the individual CDX files found
from an Election year's or group of Election years' HTML manifest.
This function loops through each file, transforms it into a `pandas.DataFrame`
by calling the `create_dataframe` function, concats the DataFrames together,
and then returns the Dataframe representing the entire manifest.
Args:
manifest (list): :obj:`list` of :obj:`str` of all the individual CDX files found
from an Election year's or group of Election years' HTML manifest.
Returns:
`pandas.DataFrame` representing every file present in the `manifest`.
"""
df = pd.DataFrame()
for index, cdx_url in enumerate(manifest):
cdx = fetch_file(cdx_url)
if len(cdx) == 0:
continue
try:
new_rows = create_dataframe(cdx)
df = pd.concat([df, new_rows])
except:
print(f'There was an error converting {cdx_url} to a dataframe. This may be\
due to a malformed CDX file. This data will be skipped.')
return df
def fetch_text(row: pd.Series):
"""
Function that takes a `pandas.Series`, which is a single row
from a `pandas.DataFrame`, as an argument.
The functions uses the timestamp and original fields from the `row`
to request the specific resource from OpenWayback. Once the resource is
fetched, the Wayback banner div elements are removed so as to not detract
from the words in the resource itself.
Args:
row (pandas.Series): `pandas.Series`, which is a single row
from a `pandas.DataFrame`.
Returns:
`String` of the resource's text. If an error is encountered, returns
an empty string.
"""
playback_url = row['original']
if (row['timestamp'] is None) or (row['timestamp']==''):
print(f'CDX row is missing timestamp. Not retrieving text for {playback_url}')
return ''
timestamp = row['timestamp']
if timestamp.startswith('2000'):
base_url = WAYBACK_LEGACY_BASE_URL
else:
base_url = WAYBACK_BASE_URL
is_blocked, response = make_request(f"{base_url}{timestamp}/{playback_url}", pause=15)
if response is None:
print(f'Error retrieving {base_url}{timestamp}/{playback_url}. Skipping full text for this document.')
return ''
if is_blocked is True:
print(f'429 too many requests. Skipping: {base_url}{timestamp}/{playback_url}')
return 429
try:
soup = BeautifulSoup(response.text, 'html.parser')
[el.extract() for el in soup.find_all('div', {'id': 'wm-maximized'})]
[el.extract() for el in soup.find_all('div', {'id': 'wm-minimized'})]
return soup.text
except:
print(f'Error parsing full text from {base_url}{timestamp}/{playback_url}. Skipping full text for this document.')
return ''
def fetch_all_text(df: pd.DataFrame):
"""
Function that takes a `pandas.Dataframe` as an argument.
This is the most complicated function here. The function first cleans the
`df` that was passed in by dropping all the rows that do not have a value in the
mimetype field. Then, it drops all the duplicate digests, which removes resources
that are exactly the same. Finally, it only returns rows that have 'text' in the
mimetype field and have a '200' or '-' HTTP status response.
Once the `df` is cleaned, each resource's text is fetched from the Wayback,
transformed into a matrix using `sklearn.CountVectorizer`, and then returns a `pandas.DataFrame`
with words and their occurance per resource. A politeness of 15 seconds is added between Wayback requests.
Args:
row (pandas.DataFrame): `pandas.Dataframe` representing web resources as CDX lines.
Returns:
`pandas.Dataframe` of the resource's words tabulated per web resource.
"""
countvec = CountVectorizer(ngram_range=(1,1), stop_words='english')
unprocessed_bag_of_words = []
text_df = df\
.dropna(subset=['mimetype'])\
.drop_duplicates(subset=['digest'])\
.query(
(
('statuscode.str.match("200") or statuscode.str.match("-")') and
('mimetype.str.contains("text")')
),
engine='python'
)
for i, row in text_df.iterrows():
fetched_text = fetch_text(row)
if fetched_text == 429:
print('Haulting requests for web archives. Received a 429 error from the server, which means too many requests too quickly.')
break
unprocessed_bag_of_words.append(fetched_text)
processed_bag_of_words = countvec.fit_transform(unprocessed_bag_of_words)
return pd.DataFrame(processed_bag_of_words.toarray(),columns=countvec.get_feature_names_out())
Gathering the list of CDX Files#
The first step is gathering the list of CDX files. To do that, simply call the gather_files_from_manifest
function, providing the Election year as an argument.
el00_files = gather_files_from_manifest('2000')
Let’s look at our first five files:
el00_files[:5]
['https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415093936.surt.cdx.gz',
'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415094743.surt.cdx.gz',
'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095044.surt.cdx.gz',
'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095244.surt.cdx.gz',
'https://data.labs.loc.gov/us-elections/by-year/2000/cdx/unique.20010415095459.surt.cdx.gz']
Inspect a sample CDX File#
Next, we’ll demonstrate what a particular CDX File looks like. We’ll look at the first five lines of our first CDX from 2000.
cdx = fetch_file(el00_files[0])
cdx[:5]
['com,voter)/home/candidates/info/0,1214,2-11880-,00.html 20001002182124 http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html text/html 200 FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP - - 5051 149 unique.20010415093936.arc.gz',
'com,voter)/home/candidates/info/0,1214,2-18885-,00.html 20001002185814 http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html text/html 200 H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL - - 4829 5200 unique.20010415093936.arc.gz',
'com,voter)/home/candidates/info/0,1214,2-18880-,00.html 20001002185815 http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html text/html 200 HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 - - 4794 10029 unique.20010415093936.arc.gz',
'com,voter)/home/officials/general/1,1195,2-2467-,00.html 20001002185815 http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html text/html 200 HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O - - 5282 14823 unique.20010415093936.arc.gz',
'com,voter)/home/candidates/info/0,1214,2-18886-,00.html 20001002185816 http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html text/html 200 QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO - - 4823 20105 unique.20010415093936.arc.gz']
Now, here is the same CDX transformed into a DataFrame
cdx_df = create_dataframe(cdx)
cdx_df
urlkey | timestamp | original | mimetype | statuscode | digest | redirect | metatags | file_size | offset | warc_filename | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | com,voter)/home/candidates/info/0,1214,2-11880-,00.html | 20001002182124 | http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html | text/html | 200 | FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP | - | - | 5051 | 149 | unique.20010415093936.arc.gz |
1 | com,voter)/home/candidates/info/0,1214,2-18885-,00.html | 20001002185814 | http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html | text/html | 200 | H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL | - | - | 4829 | 5200 | unique.20010415093936.arc.gz |
2 | com,voter)/home/candidates/info/0,1214,2-18880-,00.html | 20001002185815 | http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html | text/html | 200 | HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 | - | - | 4794 | 10029 | unique.20010415093936.arc.gz |
3 | com,voter)/home/officials/general/1,1195,2-2467-,00.html | 20001002185815 | http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html | text/html | 200 | HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O | - | - | 5282 | 14823 | unique.20010415093936.arc.gz |
4 | com,voter)/home/candidates/info/0,1214,2-18886-,00.html | 20001002185816 | http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html | text/html | 200 | QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO | - | - | 4823 | 20105 | unique.20010415093936.arc.gz |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1096875 | com,voter)/home/candidates/info/0,1214,2-9118-,00.html | 20001002183052 | http://www.voter.com:80/home/candidates/info/0,1214,2-9118-,00.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 118 | 145323588 | unique.20010415093936.arc.gz |
1096876 | com,voter)/home/candidates/info/0,1214,2-9115-,00.html | 20001002183052 | http://www.voter.com:80/home/candidates/info/0,1214,2-9115-,00.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 118 | 145323706 | unique.20010415093936.arc.gz |
1096877 | com,voter)/home/candidates/info/0,1214,2-15361-,00.html | 20001002182249 | http://www.voter.com:80/home/candidates/info/0,1214,2-15361-,00.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 119 | 145323824 | unique.20010415093936.arc.gz |
1096878 | com,voter)/home/candidates/info/0,1214,2-12994-,00.html | 20001002181842 | http://www.voter.com:80/home/candidates/info/0,1214,2-12994-,00.html | text/html | 404 | UDSH36NBYWO2X73LNMX2LEHLNQ7FYXHZ | - | - | 351 | 145323943 | unique.20010415093936.arc.gz |
1096879 | None | None | None | None | None | None | None | None | None | None | None |
1096880 rows × 11 columns
Election 2000 DataFrame#
Now we’ll create a DataFrame from the first fifteen CDX files in the 2000 election subset. To do that, we’ll use the create_dataframe_from_manifest
which loops over the CDX files and calls create_dataframe
programmatically instead of manually and individually as we did above.
If we had more time or were working on a more powerful computer, we’d pull from all of the files in the 2000 subset, but for now we’ll just pull from the first ten.
el00_df = create_dataframe_from_manifest(el00_files[0:15])
el00_df
urlkey | timestamp | original | mimetype | statuscode | digest | redirect | metatags | file_size | offset | warc_filename | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | com,voter)/home/candidates/info/0,1214,2-11880-,00.html | 20001002182124 | http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html | text/html | 200 | FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP | - | - | 5051 | 149 | unique.20010415093936.arc.gz |
1 | com,voter)/home/candidates/info/0,1214,2-18885-,00.html | 20001002185814 | http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html | text/html | 200 | H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL | - | - | 4829 | 5200 | unique.20010415093936.arc.gz |
2 | com,voter)/home/candidates/info/0,1214,2-18880-,00.html | 20001002185815 | http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html | text/html | 200 | HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 | - | - | 4794 | 10029 | unique.20010415093936.arc.gz |
3 | com,voter)/home/officials/general/1,1195,2-2467-,00.html | 20001002185815 | http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html | text/html | 200 | HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O | - | - | 5282 | 14823 | unique.20010415093936.arc.gz |
4 | com,voter)/home/candidates/info/0,1214,2-18886-,00.html | 20001002185816 | http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html | text/html | 200 | QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO | - | - | 4823 | 20105 | unique.20010415093936.arc.gz |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
338148 | org,ctgop)/county/tolland.htm | 20001006073643 | http://www.ctgop.org:80/county/tolland.htm | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 101 | 79251104 | unique.20010415101811.arc.gz |
338149 | org,ctgop)/county/tolland.htm | 20001005073549 | http://www.ctgop.org:80/county/tolland.htm | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 101 | 79251205 | unique.20010415101811.arc.gz |
338150 | org,ctgop)/county/tolland.htm | 20001004073505 | http://www.ctgop.org:80/county/tolland.htm | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 101 | 79251306 | unique.20010415101811.arc.gz |
338151 | org,ctgop)/county/tolland.htm | 20001003073437 | http://www.ctgop.org:80/county/tolland.htm | text/html | 200 | TIRWMHRDJ5L22TJWCXVA6TNU5YOB65SW | - | - | 1421 | 79251407 | unique.20010415101811.arc.gz |
338152 | None | None | None | None | None | None | None | None | None | None | None |
1541579 rows × 11 columns
Mimetypes#
For this exercise, we’re going to take a brief look at the mimetypes. First, we’ll select all the mimetypes in the Dataframe and get their sums by calling value_counts
which is a method from Pandas.
el00_mimetypes = el00_df['mimetype'].value_counts()
el00_mimetypes
mimetype
- 1493256
text/html 43969
image/jpeg 2756
image/gif 1311
application/pdf 122
text/plain 59
image/bmp 28
audio/x-pn-realaudio 18
application/msword 11
text/css 4
image/png 4
application/octet-stream 3
application/x-javascript 3
video/quicktime 3
application/zip 2
audio/x-wav 2
audio/midi 2
text/xml 2
application/mac-binhex40 1
audio/x-aiff 1
image/tiff 1
application/x-tar 1
application/x-pointplus 1
audio/x-midi 1
video/x-msvideo 1
audio/basic 1
audio/x-mpeg 1
Name: count, dtype: int64
Filtering by domain#
Let’s now look at the domains and subdomains represented in the 2000 CDX files. We’ll ignore the “www” part of URLs, but otherwise retain subdomains.
import re # For using regular expressions to remove parts of URLs
from urllib.parse import urlparse # For locating the base domain in URLs
def get_domains(urls):
if urls is None:
return []
if type(urls) == str:
urls = [urls]
domains = set()
for url in urls:
parsed_url = urlparse(url)
domain = parsed_url.netloc
if type(domain) == bytes:
domain = None
# Remove "www." and ports if they exist
if domain is None or domain == '':
continue
else:
# Remove www., www1., etc.
domain = re.sub(r"www\d?\.(.*)", r"\1", domain)
# Remove ports, as in some-website.com:80
domain = domain.split(':')[0]
domains.add(domain)
return list(domains)
el00_df['domains'] = el00_df['original'].apply(get_domains).str[0]
for cdx_domain in el00_df['domains'].unique():
print(cdx_domain)
voter.com
whitehouse.gov
hayes.voter.com
freespeech.org
cnn.com
freedomchannel.com
essential.org
fsudemocrats.org
commoncause.org
democrats.org
uspolitics.about.com
enterstageright.com
reason.com
usconservatives.about.com
idaho-democrats.org
usliberals.about.com
www10.nytimes.com
election.voter.com
graphics.nytimes.com
nydems.org
adams.voter.com
mockelection.com
rpk.org
dnet.org
commonconservative.com
beavoter.org
beavoter.com
iowademocrats.org
forums.about.com
thiselection.com
indiaelection.com
server1.dscc.org
search.algore2000.com
forums.nytimes.com
azlp.org
intellectualcapital.com
prospect.org
grassroots.com
rnc.org
lwv.org
mn-politics.com
newwestpolitics.com
popandpolitics.com
washingtonpost.com
nacdnet.org
lp.org
algore2000.com
crlp.org
harrybrowne2000.org
ga.lp.org
emilyslist.org
ncgop.org
arkdems.org
cbdnet.org
keyes-grassroots.com
faqvoter.com
americanprospect.org
partners.nytimes.com
indems.org
ageofreason.com
vanishingvoter.org
nyc.dnet.org
robots.cnn.com
informedvoter.com
virginiapolitics.com
newpolitics.com
nan
md.lp.org
ca-dem.org
beachdemocrats.org
ohiodems.org
maryland.reformparty.org
muscatinedemocrats.org
9thdistgagop.org
rcdnet.org
azgop.org
maricopagop.org
kansas.reformparty.org
newjersey.reformparty.org
california.reformparty.org
timeline.reformparty.org
algop.org
pelicanpolitics.com
espanol.voter.com
gorelieberman.com
election.com
ceednet.org
followthemoney.org
debates.org
cagop.org
wsrp.org
indgop.org
members.freespeech.org
schoolelection.com
convention.texasgop.org
cal.votenader.org
candidate.grassroots.com
1-877-leadnow.com
madison.voter.com
sierraclub.org
mt.nacdnet.org
ma.lwv.org
irchelp.org
calvoter.org
njdems.org
sfvoter.com
vademocrats.org
reformparty.org
missouridems.org
pa.lwv.org
akdemocrats.org
njlp.org
hagelin.org
keyes2000.org
tray.com
nrsc.org
deldems.org
nrcc.org
ksdp.org
kansasyoungdemocrats.org
washington.reformparty.org
dems2000.com
arkgop.com
scdp.org
plp.org
votenader.org
votenader.com
northcarolina.reformparty.org
ca.lwv.org
ks.nacdnet.org
txdemocrats.org
politics1.com
gagop.org
slp.org
gwbush.com
akrepublicans.org
wi.nacdnet.org
green.votenader.org
rpv.org
fec.gov
nytimes.com
naacp.org
hawaiidemocrats.org
nygop.org
gopatgo2000.org
democratsabroad.org
pub.whitehouse.gov
archive.lp.org
gop-mn.org
migop.org
ca.lp.org
monmouthlp.org
ncdp.org
cologop.org
mi.lp.org
cobbdemocrats.org
tx.lp.org
campaignoffice.com
freetrial.campaignoffice.com
calendar.rnc.org
rireformparty.org
ehdemocrats.org
poll1.debates.org
nevadagreenparty.org
newvoter.com
mi.lwv.org
georgia.reformparty.org
delaware.reformparty.org
stonewalldfl.org
santacruzlp.org
forums.hagelin.org
forum.hagelin.org
iowagop.org
ohiogop.org
sddemocrats.org
skdemocrats.org
wisdems.org
sfgreenparty.org
il.lp.org
rtumble.com
ctdems.org
alaskarepublicans.com
detroitnaacp.org
greenparty.org
ndgop.com
nh-democrats.org
rosecity.net
sandiegovoter.com
montanagop.org
dc.reformparty.org
greenparties.org
mainegop.com
stmarysdemocrats.org
comalcountydemocrats.org
masonforrnc.org
sblp.org
chesapeakedemocrats.org
tejanodemocrats.org
connecticut.georgewbush.com
students.georgewbush.com
youngprofessionals.georgewbush.com
maine.georgewbush.com
latinos.georgewbush.com
veterans.georgewbush.com
africanamericans.georgewbush.com
missouri.georgewbush.com
agriculture.georgewbush.com
mississippi.georgewbush.com
minnesota.georgewbush.com
arizona.georgewbush.com
northcarolina.georgewbush.com
virginia.georgewbush.com
kentucky.georgewbush.com
texas.georgewbush.com
lvvlwv.org
kansassenatedemocrats.org
nhgop.org
nebraskademocrats.org
southcarolina.reformparty.org
tndemocrats.org
fcncgop.org
padems.com
gore-2000.com
union.arkdems.org
illinois.reformparty.org
nevadagop.org
rhodeisland.reformparty.org
massdems.org
allencountydemocrats.org
mogop.org
oklahoma.reformparty.org
oklp.org
speakout.com
windemocrats.org
washingtoncountydemocrats.org
salinecodemocrats.org
njgop.org
sddp.org
pennsylvania.reformparty.org
lademo.org
allgore.com
web.democrats.org
pagop.org
library.whitehouse.gov
docs.whitehouse.gov
idaho.reformparty.org
alaska.net
georgybush.com
rpof.org
publishing1.speakout.com
de.lp.org
mainedems.org
clarkgop.com
kansashousedemocrats.org
georgiaparty.com
la.lp.org
ny.lp.org
nebraska.reformparty.org
maine.reformparty.org
indiana.reformparty.org
myweb.clark.net
clark.net
ga.lwv.org
traviscountydemocrats.org
cheshiredemocrats.org
exchange.nrcc.org
growthelp.org
sbdemocrats.org
montana.reformparty.org
politicalshop.com
massgop.com
ohio.reformparty.org
scgop.com
wvgop.org
c-span.org
westvirginia.reformparty.org
wwwalgore2000.com
texas.reformparty.org
florida-democrats.org
delawaregop.com
publicrelations.reformparty.org
nj.nacdnet.org
ohionlp.org
communications.reformparty.org
newhampshire.reformparty.org
aladems.org
arkansas.reformparty.org
avlp.org
vtdemocrats.org
jackgreenlp.org
waynegop.org
mi-democrats.com
13thdistrictdems.org
rules.reformparty.org
negop.org
dscc.org
mccain2000.com
oclp.org
ilgop.org
hawaii.reformparty.org
arch-cgi.lp.org
crnc.org
sc.ca.lp.org
8thcd.vademocrats.org
foreignpolicy2000.org
bradely.campaignoffice.com
wwwsanderson.campaignoffice.com
florida.reformparty.org
al.lp.org
dpo.org
oahudemocrats.org
columbia.arkdems.org
kentucky.reformparty.org
phoenixnewtimes.com
purepolitics.com
concernedvoter.com
iowa.reformparty.org
wyoming.reformparty.org
harriscountygreenparty.org
american-politics.com
issues.reformparty.org
nysrtlp.org
stpaul.mn.lwv.org
arlingtondemocrats.org
okgop.com
utahgop.org
utdemocrats.org
mississippi.reformparty.org
plymouth.ma.nacdnet.org
tennessee.reformparty.org
minnesota.reformparty.org
dpnm.org
georgebush2000.com
vayoungdemocrats.org
northdakota.reformparty.org
stonewalldemocrats.org
virginia.reformparty.org
fastlane.net
youngdemocrats.org
msgop.org
calgop.org
votegrassroots.com
wvdemocrats.com
housedems2000.com
lubbockdemocrats.org
ildems.org
okdemocrats.org
lccdnet.org
fecweb1.fec.gov
trinity.ca.lp.org
ventura.ca.lp.org
3rdcd.vademocrats.org
de.lwv.org
mdgop.org
flgopsenate.campaignoffice.com
bradley.campaignoffice.com
kydems.campaignoffice.com
tx.nacdnet.org
mo.nacdnet.org
texasgop.org
in.rcdnet.org
life.ca.lp.org
victory.texasgop.org
charlestondemocrats.org
wyomingdemocrats.com
nd.nacdnet.org
college.reformparty.org
al.nacdnet.org
nddemnpl.campaignoffice.com
kulick-jackson.campaignoffice.com
wasiluk.campaignoffice.com
hilstrom.campaignoffice.com
schumacher.campaignoffice.com
dfl.org
slawik.campaignoffice.com
markthompson.campaignoffice.com
rest.campaignoffice.com
vigil.campaignoffice.com
graves.campaignoffice.com
mcinnis.campaignoffice.com
hoosier.campaignoffice.com
connor.campaignoffice.com
bernardy.campaignoffice.com
housedflcaucus.campaignoffice.com
goodwin.campaignoffice.com
peaden.campaignoffice.com
kurita.campaignoffice.com
hi.lp.org
mtdemocrats.org
or.nacdnet.org
kcswcd.mo.nacdnet.org
id.nacdnet.org
sd.nacdnet.org
ny.nacdnet.org
yvote2000.com
oh.nacdnet.org
va.nacdnet.org
tn.nacdnet.org
fl.nacdnet.org
ca.nacdnet.org
co.nacdnet.org
ky.lp.org
georgewbush.com
massachusetts.reformparty.org
arizona.reformparty.org
louisiana.reformparty.org
nm.nacdnet.org
tazewell.va.nacdnet.org
aflcio.org
azdem.org
columbiana.oh.nacdnet.org
lacledeswcd.mo.nacdnet.org
reclaimdemocracy.org
ctgop.org
nevada.reformparty.org
in.nacdnet.org
michigan.reformparty.org
newyork.reformparty.org
nc.nacdnet.org
wa.nacdnet.org
ak.nacdnet.org
pa.nacdnet.org
billbradley.com
macdnet.org
lmcd.mt.nacdnet.org
socialdemocrats.org
bexardemocrats.org
alabama.reformparty.org
globalelection.com
wisconsin.reformparty.org
geocities.com
coloradodems.org
As you can see from the list above, the 2000 CDX files include content from a wide range of domains, not limited to political candidate campaign websites.
In the early years of the United States Elections Web Archive, the scope of the collection included websites of political parties, government, advocacy groups, bloggers, and other individuals and groups expressing relevant views. These sites have generally been moved into the Public Policy Topics Web Archive or into the general web archives. However, the CDX files index the content as it was originally captured. The CDX files may also index content from non-candidate resources if candidate sites linked to those resources or embedded that content. Occassionally, other out of scope content may also appear in CDX files otherwise dedicated to U.S. elections.
Let’s grab only those lines from the CDX files that match domains from the candidate websites in our metadata.csv
file. We’ll include the campaign candidates’ websites themselves, as well as any domains that appear in the scope
column. Domains that appear in the scope
column are additional URLs that the web archiving crawler was instructed to collect in addition to the campaign website, if the campaign website linked to those URLs. For a more refined description, see this data package’s README
.
year = 2000
def get_domains_by_year(year):
year_metadata = metadata_df[metadata_df['collection'].str.contains(year)].copy()
if len(year_metadata) > 0:
year_metadata['seeds_domains'] = year_metadata['website_url'].apply(get_domains)
year_metadata['scope_domains'] = year_metadata['website_scopes'].apply(get_domains)
year_metadata['all_domains'] = year_metadata['seeds_domains'] + year_metadata['scope_domains']
all_domains = [item for sublist in year_metadata['all_domains'].dropna() for item in sublist]
return list(set(all_domains))
else:
print(f'Sorry, there were no rows in metadata.csv for content from {year}')
metadata_domains = get_domains_by_year(str(year))
print(f'Domains from the {str(year)} US Elections collection:')
metadata_domains
Domains from the 2000 US Elections collection:
['algore2000.com',
'harrybrowne2000.org',
'gopatgo2000.org',
'algore.com',
'keyes2000.org',
'hagelin.org',
'forum.hagelin.org']
Now we’re ready to filter. Let’s filter down our sample CDX lines to just those lines that point to the candidate website domains from metadata.csv, listed above. This means we’ll only include CDX rows for domains like algore2000.org
, gopatgo2000.org
, but not sites like vote.com
or whitehouse.gov
.
cdx_candidate_domains_el00 = el00_df[
el00_df['original'].apply(
lambda url:
any(domain in url for domain in metadata_domains) if url
else False
)
]
cdx_candidate_domains_el00
urlkey | timestamp | original | mimetype | statuscode | digest | redirect | metatags | file_size | offset | warc_filename | domains | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
25175 | com,algore2000,search)/search | 20001030063531 | http://search.algore2000.com:80/search/ | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 97 | 7471624 | unique.20010415093936.arc.gz | search.algore2000.com |
26166 | com,algore2000,search)/search | 20001030053022 | http://search.algore2000.com:80/search/ | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 97 | 7587973 | unique.20010415093936.arc.gz | search.algore2000.com |
49892 | com,algore2000,search)/search | 20001029053020 | http://search.algore2000.com:80/search/ | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 97 | 10612154 | unique.20010415093936.arc.gz | search.algore2000.com |
73526 | com,algore2000,search)/search | 20001028053001 | http://search.algore2000.com:80/search/ | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 99 | 13619683 | unique.20010415093936.arc.gz | search.algore2000.com |
97191 | com,algore2000,search)/search | 20001027053201 | http://search.algore2000.com:80/search/ | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 98 | 16632272 | unique.20010415093936.arc.gz | search.algore2000.com |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336264 | org,keyes2000)/images/newsimage.jpg | 20001003073434 | http://keyes2000.org:80/images/newsimage.jpg | image/jpeg | 200 | LWERVVNORJQ6IBZCJ4SBNH26JU6NH3MV | - | - | 13527 | 76178594 | unique.20010415101811.arc.gz | keyes2000.org |
336611 | com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | 20001004075816 | http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 130 | 76906140 | unique.20010415101811.arc.gz | algore2000.com |
336612 | com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | 20001004073516 | http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 127 | 76906270 | unique.20010415101811.arc.gz | algore2000.com |
336613 | com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | 20001003075840 | http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | - | - | 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | - | - | 130 | 76906397 | unique.20010415101811.arc.gz | algore2000.com |
336614 | com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | 20001003073434 | http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | text/html | 200 | 6Y6BX6SUDNF5CASBJH2LASINQ46ASMQF | - | - | 9606 | 76906527 | unique.20010415101811.arc.gz | algore2000.com |
6448 rows × 12 columns
Fetching the Text#
Now that we know the majority of the remaining resources in this dataset have a text-based mimetype, we can gather all the text and do some basic analysis. First, we’ll fetch all the text from just 50 rows. This will take a few minutes.
text_df = fetch_all_text(cdx_candidate_domains_el00.tail(50))
Top 25 Words#
Now that the text has been fetched, we’ll do a simple summation and sorting, displaying the top 25 words from the first 50 rows of the 2000 Election dataset.
text_df.sum(axis=0).sort_values(ascending=False).head(25)
tax 34
bush 34
cut 18
plan 16
republicans 11
new 11
gop 9
republican 7
budget 7
vote 7
00 7
gore 7
senate 6
committee 6
year 6
george 6
news 5
americans 5
2000 5
congressional 5
york 5
dakota 4
carolina 4
virginia 4
debt 4
dtype: int64