whtranscripts helps you fetch and parse transcripts from the American Presidency Project's press-briefing and presidential-news-conference transcripts.
whtranscripts is a Python library. To install it, run:
pip install whtranscriptsTo download the HTML of all news-conference transcripts:
mkdir ~/Downloads/conference-transcripts
python -m "whtranscripts.download" -t conference --dest ~/Downloads/conference-transcripts/For press-briefings:
mkdir ~/Downloads/another-dir
python -m "whtranscripts.download" -t briefing --dest ~/Downloads/another-dir/You can also limit downloads to a particular year-range, e.g., from 2001 through 2008:
python -m "whtranscripts.download" -t conference --dest ~/Downloads/conference-transcripts/ --start 2001 --end 2008You can load single transcripts from a file, URL, or the HTML itself. From a file:
import whtranscripts
transcript = whtranscripts.Conference.from_path("test/pages/conferences/99975.html")Alternatively, for a briefing:
import whtranscripts
transcript = whtranscripts.Briefing.from_path("test/pages/briefings/47646.html")From a URL:
import whtranscripts
url = "http://www.presidency.ucsb.edu/ws/index.php?pid=99975"
transcript = whtranscripts.Conference.from_url(url)Directly from American Presidency Project HTML:
import whtranscripts
import requests
url = "http://www.presidency.ucsb.edu/ws/index.php?pid=99975"
html = requests.get(url).content
transcript = whtranscripts.Conference(html)You can also load multiple at once, from a directory:
import whtranscripts
transcripts = whtranscripts.Conference.from_dir("test/pages/conferences")Note: The files you want to parse from directory must end in .html
Each Conference and Briefing has the following attributes:
doc_id: The document ID assigned to it by the American Presidency Project.date: The date the conference or briefing took place.president: The U.S. president at the time of the briefing.passages: A list ofPassageobjects.
Each Passage object has the following attributes:
speaker: The person who spoke the passage.is_question:Falseif the speaker was an government official/guest,Trueif they were someone from the audience.text: What was said.transcript: A pointer back to the parent transcript in which this passage can be found.tokens: All of the tokens in the passage (using NLTK's word_tokenize module). Requires NLTK to be installed.
Each Passage object also has the following methods:
get_word_count: Returns the total word count of the passage, found by splitting on spaces.count_occurrences: Returns the total number of occurences of a string. Note: This method catches strings inside of words. So go will match twice on "I wish I could go somewhere a long time ago." (go and ago.) By default, this is not case sensitive. Passcase_sensitive=Trueto make the search case sensitive.count_token_occurrences: Similar tocount_occurrences, but uses "tokens" generated by NLTK. Will raise an error if NLTK is not installed.
You can export transcripts as CSVs, using the TranscriptSet class:
import whtranscripts
urls = whtranscripts.download.get_urls("conference", 2013, 2013)
transcripts = map(whtranscripts.conference.Conference.from_url, urls)
t_set = whtranscripts.TranscriptSet(transcripts)
t_set.to_csv(sys.stdout)