Text tokenization#
This section contains code that will tokenize the transcription data and add new columns to the data frames for each transcription dataset.
First, we run the definitions step from the previous section.
%run 02_definitions.ipynb
Load each transcription dataset into a data frame#
The load_csv
function will read the data from each path constant and store data in a Pandas data frame.
a = load_csv(ANTHONY)
c = load_csv(CATT)
s = load_csv(STANTON)
t = load_csv(TERRELL)
Optional: Preview the first five lines of a loaded dataset#
First five lines for Anthony dataset#
a.head()
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-1 | 179295 | completed | http://tile.loc.gov/image-services/iiif/servic... | Susan B. Anthony SPEECHES AND WRITINGS FI... | May 1852 |
1 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-2 | 179296 | completed | http://tile.loc.gov/image-services/iiif/servic... | /52\r\nS.B.A-\r\n\r\nDelivered for the\r\nFirs... | NaN |
2 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-3 | 179297 | completed | http://tile.loc.gov/image-services/iiif/servic... | will the best & wisest of mothers continue\r\n... | temperance |
3 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-4 | 179298 | completed | http://tile.loc.gov/image-services/iiif/servic... | [Mind] the youthful mind. Of how\r\nlittle av... | temperance |
4 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-5 | 179299 | completed | http://tile.loc.gov/image-services/iiif/servic... | x\r\nWhile we labor to reclaim one generation ... | temperance |
First five lines for Catt dataset#
c.head()
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-1 | 189284 | completed | http://tile.loc.gov/image-services/iiif/servic... | CATT, Carrie Chapman\r\nSPEECH, ARTICLE, BOOK ... | NaN |
1 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-2 | 189285 | completed | http://tile.loc.gov/image-services/iiif/servic... | -2-\r\nWe appeal in the name of our foremother... | NaN |
2 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-3 | 189286 | completed | http://tile.loc.gov/image-services/iiif/servic... | AN APPEAL FOR LIBERTY. 1915\r\n\r\nBy Carri... | NaN |
3 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040386 | mss154040386-1 | 189287 | completed | http://tile.loc.gov/image-services/iiif/servic... | CATT, Carrie Chapman\r\nSPEECH, ARTICLE, BOOK ... | NaN |
4 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040386 | mss154040386-2 | 189288 | completed | http://tile.loc.gov/image-services/iiif/servic... | The \r\nWoman Citizen\r\nA WEEKLY CHRONICLE OF... | NaN |
First five lines for Stanton dataset#
s.head()
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-1 | 179712 | completed | http://tile.loc.gov/image-services/iiif/servic... | Elizabeth Cady Stanton GENERAL CORRESPONDENCE... | NaN |
1 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-2 | 179713 | completed | http://tile.loc.gov/image-services/iiif/servic... | The following four letters are \r\nfrom Daniel... | Peter Smith; Daniel Cady; Judge Cady |
2 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-3 | 179714 | completed | http://tile.loc.gov/image-services/iiif/servic... | 22 ... | NaN |
3 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-4 | 179715 | completed | http://tile.loc.gov/image-services/iiif/servic... | he could to make her respectable & happy. That... | Peter Smith; Bonaparte |
4 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-5 | 179716 | completed | http://tile.loc.gov/image-services/iiif/servic... | Johnstown 2 D Paid 10\r\n\r\n\r\nPeter Smi... | Peter Smith |
First five lines for Terrell dataset#
t.head()
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-1 | 7580 | completed | http://tile.loc.gov/image-services/iiif/servic... | Office Supplies typewriter ribbons fountain pe... | Mrs Ella Wheeler Wilcox; Woman Suffrage Conven... |
1 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-2 | 7581 | completed | http://tile.loc.gov/image-services/iiif/servic... | March 16, Wednesday,1904 - Dr. Booker Washingt... | Cruger; Calloway; VanRensselaer; Booker; Washi... |
2 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-3 | 7582 | completed | http://tile.loc.gov/image-services/iiif/servic... | Fountain Pens Repaired\r\nTablets\r\nTypewrite... | Pennsylvania; committee; Washington Post |
3 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-4 | 7583 | completed | http://tile.loc.gov/image-services/iiif/servic... | May, 1904\r\n\r\n1 SUNDAY Received invitation ... | NaN |
4 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-5 | 7584 | completed | http://tile.loc.gov/image-services/iiif/servic... | June, 1904\r\n\r\n7 TUESDAY Reached Bremer Hav... | Berlin; Congress morning; June 1904; Paris |
Create a new column containing the output of the tokens
function#
The tokens
function uses the previously loaded spaCy model to analyze each word in the transcription. This results in several values for each word, including the lemma, the part-of-speech tag, the shape of the word, and whether it is a stop word or number.
# NOTE: This will take a while to run
for dataset in [a, c, s, t]:
print(f"Tokenizing text for dataset: {dataset['Campaign'][0]}")
dataset['tokenized_text'] = dataset['Transcription'].apply(tokens)
print("Done!")
Tokenizing text for dataset: Susan B. Anthony Papers
Tokenizing text for dataset: Carrie Chapman Catt Papers
Tokenizing text for dataset: Elizabeth Cady Stanton Papers
Tokenizing text for dataset: Mary Church Terrell: Advocate for African Americans and Women
Done!
Create a new column containing the output of the entities
function#
The entities
function uses the previously loaded spaCy model to identify persons, places, organizations, etc.
# NOTE: This will take a while to run
for dataset in [a, c, s, t]:
print(f"Identifying entities for dataset: {dataset['Campaign'][0]}")
dataset['entities'] = dataset['Transcription'].apply(entities)
print("Done!")
Identifying entities for dataset: Susan B. Anthony Papers
Identifying entities for dataset: Carrie Chapman Catt Papers
Identifying entities for dataset: Elizabeth Cady Stanton Papers
Identifying entities for dataset: Mary Church Terrell: Advocate for African Americans and Women
Done!
Optional: Preview the results of the entities
functions for the first row of a dataset#
Entities for first row in Anthony dataset#
pd.DataFrame([{"Text": row[0], "Entity": row[3]} for row in a['entities'].iloc[0]]).head(10)
Text | Entity | |
---|---|---|
0 | Susan B. Anthony SPEECHES | PERSON |
1 | WRITINGS FILE Delivered | ORG |
2 | first | ORDINAL |
3 | Batavia | GPE |
4 | 2 | CARDINAL |
5 | May 1852 | DATE |
6 | 1852 | DATE |
Entities for first row in Catt dataset#
pd.DataFrame([{"Text": row[0], "Entity": row[3]} for row in c['entities'].iloc[0]]).head(10)
Text | Entity | |
---|---|---|
0 | CATT | PERSON |
1 | Carrie Chapman\r\nSPEECH | PERSON |
2 | ARTICLE, BOOK FILE\r\nSpeech | LAW |
Entities for first row in Stanton dataset#
pd.DataFrame([{"Text": row[0], "Entity": row[3]} for row in s['entities'].iloc[0]]).head(10)
Text | Entity | |
---|---|---|
0 | Elizabeth Cady Stanton | PERSON |
1 | 1814 - 49 | DATE |
Entities for first row in Terrell dataset#
pd.DataFrame([{"Text": row[0], "Entity": row[3]} for row in t['entities'].iloc[0]]).head(10)
Text | Entity | |
---|---|---|
0 | Swett | PERSON |
1 | Stationery Blank Books | ORG |
2 | P Swett | PERSON |
3 | February, 1904 | DATE |
4 | 178 | CARDINAL |
5 | Monday\r\n2 Tuesday\r\n3 | DATE |
6 | Wednesday\r\n4 | DATE |
7 | Thursday \r\n5 Friday\r\n6 | DATE |
8 | Crandall Association | ORG |
9 | 7:30\r\nSpecial | TIME |
Run the separate_text
function to isolate tokens by category#
The separate_text
function uses labels generated by the spaCy
library to organize the contents of each transcription into actual text, stop words (conjunctions, prepositions, etc.), non-alphanumeric strings (punctuation, whitespace, etc.), numbers, and ambiguous words (when a transcriber cannot make out a word or character, a ?
will be used for the unknown character(s); this is reflected in the analyzed pattern of the word which is used to remove these words from the text category).
# Run the separate_text function on the Anthony data frame
for dataset in [a, c, s, t]:
print(f"Organizing tokens by category for: {dataset['Campaign'][0]}")
separate_text(dataset)
print("Done!")
Organizing tokens by category for: Susan B. Anthony Papers
Organizing tokens by category for: Carrie Chapman Catt Papers
Organizing tokens by category for: Elizabeth Cady Stanton Papers
Organizing tokens by category for: Mary Church Terrell: Advocate for African Americans and Women
Done!
Cache the result for next steps#
for dataset in [a, c, s, t]:
write_cache(dataset, str(dataset['Campaign'][0]))
Optional: Preview the results for the first five rows of the updated data frame#
First five rows of updated Anthony dataset#
a.iloc[0:5]
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | tokenized_text | entities | text | stop_words | nonalphanums | numbers | ambigs | processed_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-1 | 179295 | completed | http://tile.loc.gov/image-services/iiif/servic... | Susan B. Anthony SPEECHES AND WRITINGS FI... | May 1852 | [(Susan, Susan, PROPN, NNP, compound, Xxxxx, T... | [(Susan B. Anthony SPEECHES, 0, 30, PERSO... | [(Susan, Susan, PROPN, NNP, compound, Xxxxx, T... | [(AND, and, CCONJ, CC, cc, XXX, True, True), (... | [( , , SPACE, _SP, dep, , False, ... | [(2, 2, NUM, CD, nummod, d, False, False), (18... | [] | [susan, b., anthony, speeches, writing, file, ... |
1 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-2 | 179296 | completed | http://tile.loc.gov/image-services/iiif/servic... | /52\r\nS.B.A-\r\n\r\nDelivered for the\r\nFirs... | NaN | [(/52, /52, PROPN, NNP, punct, /dd, False, Fal... | [(Batavia, 44, 51, GPE), (N.J., 52, 56, GPE), ... | [(/52, /52, PROPN, NNP, punct, /dd, False, Fal... | [(for, for, ADP, IN, prep, xxx, True, True), (... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [(1852, 1852, NUM, CD, nummod, dddd, False, Fa... | [] | [/52, s.b.a-, deliver, batavia, n.j., company,... |
2 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-3 | 179297 | completed | http://tile.loc.gov/image-services/iiif/servic... | will the best & wisest of mothers continue\r\n... | temperance | [(will, will, AUX, MD, aux, xxxx, True, True),... | [(the\r\nSociety, 295, 307, ORG), (two, 324, 3... | [(best, good, ADJ, JJS, nsubj, xxxx, True, Fal... | [(will, will, AUX, MD, aux, xxxx, True, True),... | [(&, &, CCONJ, CC, cc, &, False, False), (\r\n... | [] | [] | [good, wise, mother, continue, son, fall, vict... |
3 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-4 | 179298 | completed | http://tile.loc.gov/image-services/iiif/servic... | [Mind] the youthful mind. Of how\r\nlittle av... | temperance | [([, [, X, XX, dep, [, False, False), (Mind, m... | [(christian, 77, 86, NORP), (truth & sobernes... | [(Mind, mind, VERB, VB, dep, Xxxx, True, False... | [(the, the, DET, DT, det, xxx, True, True), (O... | [([, [, X, XX, dep, [, False, False), (], ], X... | [] | [] | [mind, youthful, mind, little, avail, untire, ... |
4 | Susan B. Anthony Papers | Speeches and other writings | Susan B. Anthony Papers: Speeches and Writings... | mss11049038 | mss11049038-5 | 179299 | completed | http://tile.loc.gov/image-services/iiif/servic... | x\r\nWhile we labor to reclaim one generation ... | temperance | [(x, x, ADP, IN, punct, x, True, False), (\r\n... | [(one, 29, 32, CARDINAL), (Legislature, 145, 1... | [(x, x, ADP, IN, punct, x, True, False), (labo... | [(While, while, SCONJ, IN, mark, Xxxxx, True, ... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [] | [] | [x, labor, reclaim, generation, drunkard, rise... |
First five rows of updated Catt dataset#
c.iloc[0:5]
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | tokenized_text | entities | text | stop_words | nonalphanums | numbers | ambigs | processed_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-1 | 189284 | completed | http://tile.loc.gov/image-services/iiif/servic... | CATT, Carrie Chapman\r\nSPEECH, ARTICLE, BOOK ... | NaN | [(CATT, CATT, PROPN, NNP, ROOT, XXXX, True, Fa... | [(CATT, 0, 4, PERSON), (Carrie Chapman\r\nSPEE... | [(CATT, CATT, PROPN, NNP, ROOT, XXXX, True, Fa... | [(An, an, DET, DT, det, Xx, True, True), (For,... | [(,, ,, PUNCT, ,, punct, ,, False, False), (\r... | [] | [] | [catt, carrie, chapman, speech, article, book,... |
1 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-2 | 189285 | completed | http://tile.loc.gov/image-services/iiif/servic... | -2-\r\nWe appeal in the name of our foremother... | NaN | [(-2-, -2-, PUNCT, ``, punct, -d-, False, Fals... | [(-2-\r\n, 0, 5, PERSON), (American, 428, 436,... | [(appeal, appeal, VERB, VBP, ccomp, xxxx, True... | [(We, we, PRON, PRP, nsubj, Xx, True, True), (... | [(-2-, -2-, PUNCT, ``, punct, -d-, False, Fals... | [(1,600,000, 1,600,000, NUM, CD, nummod, d,ddd... | [] | [appeal, foremother, forefather, equal, courag... |
2 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040385 | mss154040385-3 | 189286 | completed | http://tile.loc.gov/image-services/iiif/servic... | AN APPEAL FOR LIBERTY. 1915\r\n\r\nBy Carri... | NaN | [(AN, an, DET, DT, det, XX, True, True), (APPE... | [(1915, 26, 30, DATE), (Carrie Chapman Catt, 3... | [(APPEAL, APPEAL, PROPN, NNP, ROOT, XXXX, True... | [(AN, an, DET, DT, det, XX, True, True), (FOR,... | [(., ., PUNCT, ., punct, ., False, False), ( ... | [(1915, 1915, NUM, CD, ROOT, dddd, False, Fals... | [] | [appeal, liberty, carrie, chapman, catt, year,... |
3 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040386 | mss154040386-1 | 189287 | completed | http://tile.loc.gov/image-services/iiif/servic... | CATT, Carrie Chapman\r\nSPEECH, ARTICLE, BOOK ... | NaN | [(CATT, CATT, PROPN, NNP, ROOT, XXXX, True, Fa... | [(CATT, 0, 4, PERSON), (Carrie Chapman\r\nSPEE... | [(CATT, CATT, PROPN, NNP, ROOT, XXXX, True, Fa... | [(Be, be, AUX, VB, ROOT, Xx, True, True)] | [(,, ,, PUNCT, ,, punct, ,, False, False), (\r... | [] | [] | [catt, carrie, chapman, speech, article, book,... |
4 | Carrie Chapman Catt Papers | Speeches and articles | Carrie Chapman Catt Papers: Speech and Article... | mss154040386 | mss154040386-2 | 189288 | completed | http://tile.loc.gov/image-services/iiif/servic... | The \r\nWoman Citizen\r\nA WEEKLY CHRONICLE OF... | NaN | [(The, the, DET, DT, det, Xxx, True, True), (\... | [(Carrie Chapman Catt, 156, 175, PERSON), (Con... | [(Woman, Woman, PROPN, NNP, compound, Xxxxx, T... | [(The, the, DET, DT, det, Xxx, True, True), (A... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [(21, 21, NUM, CD, nummod, dd, False, False), ... | [] | [woman, citizen, weekly, chronicle, progress, ... |
First five rows of updated Stanton dataset#
s.iloc[0:5]
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | tokenized_text | entities | text | stop_words | nonalphanums | numbers | ambigs | processed_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-1 | 179712 | completed | http://tile.loc.gov/image-services/iiif/servic... | Elizabeth Cady Stanton GENERAL CORRESPONDENCE... | NaN | [(Elizabeth, Elizabeth, PROPN, NNP, compound, ... | [(Elizabeth Cady Stanton, 0, 22, PERSON), (181... | [(Elizabeth, Elizabeth, PROPN, NNP, compound, ... | [] | [( , , SPACE, _SP, dep, , False, False), (-,... | [(1814, 1814, NUM, CD, appos, dddd, False, Fal... | [] | [elizabeth, cady, stanton, general, correspond... |
1 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-2 | 179713 | completed | http://tile.loc.gov/image-services/iiif/servic... | The following four letters are \r\nfrom Daniel... | Peter Smith; Daniel Cady; Judge Cady | [(The, the, DET, DT, det, Xxx, True, True), (f... | [(four, 14, 18, CARDINAL), (Daniel Cady, 38, 4... | [(following, follow, VERB, VBG, amod, xxxx, Tr... | [(The, the, DET, DT, det, Xxx, True, True), (f... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [] | [] | [follow, letter, daniel, cady, peter, smith, j... |
2 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-3 | 179714 | completed | http://tile.loc.gov/image-services/iiif/servic... | 22 ... | NaN | [(22, 22, NUM, CD, ROOT, dd, False, False), ( ... | [(22, 0, 2, CARDINAL), (2 Dec. 1814, 91, 102, ... | [(Dec., Dec., PROPN, NNP, npadvmod, Xxx., Fals... | [(It, it, PRON, PRP, nsubj, Xx, True, True), (... | [( ... | [(22, 22, NUM, CD, ROOT, dd, False, False), (2... | [] | [dec., dear, sir, true, lose, young, child, th... |
3 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-4 | 179715 | completed | http://tile.loc.gov/image-services/iiif/servic... | he could to make her respectable & happy. That... | Peter Smith; Bonaparte | [(he, he, PRON, PRP, nsubj, xx, True, True), (... | [(one, 467, 470, CARDINAL), (one, 614, 617, CA... | [(respectable, respectable, ADJ, JJ, ccomp, xx... | [(he, he, PRON, PRP, nsubj, xx, True, True), (... | [(&, &, CCONJ, CC, cc, &, False, False), (., .... | [(2d, 2d, NUM, CD, nummod, dx, False, False), ... | [] | [respectable, happy, moment, flatter, soon, se... |
4 | Elizabeth Cady Stanton Papers | General correspondence | Elizabeth Cady Stanton Papers: General Corresp... | mss412100001 | mss412100001-5 | 179716 | completed | http://tile.loc.gov/image-services/iiif/servic... | Johnstown 2 D Paid 10\r\n\r\n\r\nPeter Smi... | Peter Smith | [(Johnstown, Johnstown, PROPN, NNP, nmod, Xxxx... | [(Johnstown, 0, 9, GPE), (10, 23, 25, CARDINAL... | [(Johnstown, Johnstown, PROPN, NNP, nmod, Xxxx... | [] | [( , , SPACE, _SP, dep, , False, Fa... | [(2, 2, NUM, CD, nummod, d, False, False), (10... | [] | [johnstown, d, paid, peter, smith, esquire, pe... |
First five rows of updated Terrell dataset#
t.iloc[0:5]
Campaign | Project | Item | ItemId | Asset | AssetId | AssetStatus | DownloadUrl | Transcription | Tags | tokenized_text | entities | text | stop_words | nonalphanums | numbers | ambigs | processed_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-1 | 7580 | completed | http://tile.loc.gov/image-services/iiif/servic... | Office Supplies typewriter ribbons fountain pe... | Mrs Ella Wheeler Wilcox; Woman Suffrage Conven... | [(Office, office, NOUN, NN, compound, Xxxxx, T... | [(Swett, 101, 106, PERSON), (Stationery Blank ... | [(Office, office, NOUN, NN, compound, Xxxxx, T... | [(’s, ’s, PART, POS, case, ’x, False, True), (... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [(603, 603, NUM, CD, nummod, ddd, False, False... | [] | [office, supply, typewriter, ribbon, fountain,... |
1 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-2 | 7581 | completed | http://tile.loc.gov/image-services/iiif/servic... | March 16, Wednesday,1904 - Dr. Booker Washingt... | Cruger; Calloway; VanRensselaer; Booker; Washi... | [(March, March, PROPN, NNP, npadvmod, Xxxxx, T... | [(March 16, 0, 8, DATE), (Booker, 31, 37, PERS... | [(March, March, PROPN, NNP, npadvmod, Xxxxx, T... | [(as, as, ADP, IN, prep, xx, True, True), (our... | [(,, ,, PUNCT, ,, punct, ,, False, False), (-,... | [(16, 16, NUM, CD, nummod, dd, False, False), ... | [] | [march, wednesday,1904, dr., booker, washingto... |
2 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-3 | 7582 | completed | http://tile.loc.gov/image-services/iiif/servic... | Fountain Pens Repaired\r\nTablets\r\nTypewrite... | Pennsylvania; committee; Washington Post | [(Fountain, Fountain, PROPN, NNP, compound, Xx... | [(Fountain Pens Repaired\r\nTablets\r\nTypewri... | [(Fountain, Fountain, PROPN, NNP, compound, Xx... | [('s, 's, PART, POS, case, 'x, False, True), (... | [(\r\n, \r\n, SPACE, _SP, dep, \r\n, False, Fa... | [(603, 603, NUM, CD, nummod, ddd, False, False... | [(?, ?, ADJ, JJ, punct, ?, False, False), (Wi?... | [fountain, pens, repaired, tablet, typewriter,... |
3 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-4 | 7583 | completed | http://tile.loc.gov/image-services/iiif/servic... | May, 1904\r\n\r\n1 SUNDAY Received invitation ... | NaN | [(May, May, PROPN, NNP, nmod, Xxx, True, True)... | [(May, 1904, 0, 9, DATE), (1, 13, 14, CARDINAL... | [(SUNDAY, SUNDAY, PROPN, NNP, appos, XXXX, Tru... | [(May, May, PROPN, NNP, nmod, Xxx, True, True)... | [(,, ,, PUNCT, ,, punct, ,, False, False), (\r... | [(1904, 1904, NUM, CD, nummod, dddd, False, Fa... | [] | [sunday, receive, invitation, fran, olga, mr, ... |
4 | Mary Church Terrell: Advocate for African Amer... | Address and appointment books | Mary Church Terrell Papers: Appointment Calend... | mss425490014 | mss425490014-5 | 7584 | completed | http://tile.loc.gov/image-services/iiif/servic... | June, 1904\r\n\r\n7 TUESDAY Reached Bremer Hav... | Berlin; Congress morning; June 1904; Paris | [(June, June, PROPN, NNP, npadvmod, Xxxx, True... | [(June, 1904, 0, 10, DATE), (7, 14, 15, CARDIN... | [(June, June, PROPN, NNP, npadvmod, Xxxx, True... | [(in, in, ADP, IN, prep, xx, True, True), (at,... | [(,, ,, PUNCT, ,, punct, ,, False, False), (\r... | [(1904, 1904, NUM, CD, nummod, dddd, False, Fa... | [] | [june, tuesday, reach, bremer, haven, morning,... |