Data Processing

Data Processing#

First, we run the definitions step

%run 02_definitions.ipynb

And load cached data from the previous section

a = read_cache("Susan B. Anthony Papers")
c = read_cache("Carrie Chapman Catt Papers")
s = read_cache("Elizabeth Cady Stanton Papers")
t = read_cache("Mary Church Terrell: Advocate for African Americans and Women")

Process the Susan B. Anthony speech subset#

A typed inventory of speeches in the Susan B. Anthony Papers is available; this allows for subsetting the transcription data so that it can be processed and visualized separately from the entire transcription dataset.

Extract the transcription text for the speeches#

The code below will group transcription data by the ItemId. The speech inventory will then be used to subset the transcription data using the ItemId of known speeches. The transcribed text will then be combined at the item level and stored in a dictionary that lists the id, year, title, and text of each speech.

# Load the speech inventory
a_speeches = load_csv(SPEECHES)

# Group transcriptions by ItemId
# Creates a dictionary where the ItemId is the key and the value is a list of associated row indexes
a_groups = a.groupby('ItemId').groups

# Create a list of dictionaries representing each speech
# This structure is specifically designed for visualization in the next notebook
speech_list = []

for row in range(a_speeches.shape[0]):
    d = re.findall('\d{4}', a_speeches.iloc[row][1])
    speech_id = a_speeches.iloc[row][0]
    speech_text = []
    for i in a_groups[speech_id]:
        speech_text.extend(a['processed_text'].iloc[i])
    speech = {'id': speech_id, 
              'year': d[0], 
              'title': a_speeches.iloc[row][2], 
              'text': speech_text}
    speech_list.append(speech)

Save `speech_list` to a Python file for import or reuse beyond these notebooks#

write_cache(pd.DataFrame(speech_list), "anthony_speech_lemmas")

Process transcription data for all four datasets#

The following code will prepare the data similar to the Susan B. Anthony speech subset above. Running this code is necessary for visualizing at the dataset-level for all four datasets.

Create a list of all words from `processed_text` for each dataset#

This code will create a dictionary containing the titles and aggregated text from the processed_text column for each dataset.

transcriptions = []

for dataset in [a, c, s, t]:
    transcription_text = []
    for row in range(dataset.shape[0]):
        transcription_text.extend(dataset['processed_text'].iloc[row])
    transcription = {'title': dataset['Campaign'][0],
                     'text': transcription_text}
    transcriptions.append(transcription)

Save `transcriptions` to a Python file for import or reuse beyond these notebooks#

write_cache(pd.DataFrame(transcriptions), "transcriptions_lemmas")

Data Processing

Contents