FAQ
How current is the data?
Ranges are automatically calculated from the earliest to latest datum of each type:
- Transcripts
- 2000-01-24 to 2021-06-25
- Bill and sponsorship data
- 2000-01-24 to 2021-06-23
- Privately sponsored travel data (House only)
- 2007-08-24 to 2021-06-02
Congress began publicly reporting member and staffer travel sponsored by private entities in 2007 with the Honest Leadership and Open Government Act. Unfortunately, the Senate dataset is not easily analyzed, since important fields like the sponsor and destination are only available in the scanned reimbursement affidavits.
Other travel sponsored by foreign governments is not yet required to be reported under the Mutual Educational and Cultural Exchange Act of 1961.
How does it work?
Transcripts are automatically chunked by speaker and treated to remove irrelevancies like case, punctuation, timestamps, page numbers, and procedural speech. Each chunk is split into words or tokenized, and common "stopwords" like "a", "this", "for", and so on removed. At this point, the tokens are copied and stemmed to remove inflections (e.g. "organization" becomes "organ", losing both "-ation" and "-ize" suffixes). Both the stemmed and the original tokens are then grouped into sliding windows of multiple lengths, or grams.
For an example, here's a selection from Eisenhower's farewell address with stopwords removed and split into grams of length 3, or trigrams:
In the councils of government, we must guard against the acquisition of unwarranted influence, whether sought or unsought, by the military-industrial complex. The potential for the disastrous rise of misplaced power exists and will persist.
councils government must | government must guard | must guard acquisition |
guard acquisition unwarranted | acquisition unwarranted influence | unwarranted influence whether |
influence whether sought | whether sought unsought | sought unsought military |
unsought military industrial | military industrial complex | industrial complex potential |
complex potential disastrous | potential disastrous rise | disastrous rise misplaced |
rise misplaced power | misplaced power exists | power exists persist |
Most of these trigrams are likely unique to these few sentences. Eisenhower could well have mentioned that the "government must guard" against this or that on other occasions, and it's not hard to imagine him deploying "whether sought [or] unsought" a time or two either. However, the standout gram here is the subject of the speech: the "military[-]industrial complex". The 1961 farewell address was near the end of Eisenhower's speechmaking career, but politicians now hardly discuss the imbrication of the military with mining and extraction, manufacturing, tech, and other interests except that they call it the "military-industrial complex", over and over.
The Lexington Concordance's fundamental assumption is that frequency correlates with focus. Grams which appear more often indicate topics of importance to the speaker (and the odd rhetorical flourish), and trends of gram appearance and disappearance over time show incompletely but in detail where speakers are concentrating their attention and efforts.
Where do the transcripts and speaker data come from?
The Congressional Record is an official government publication, with archives at https://www.govinfo.gov/content/pkg/CREC-yyyy-mm-dd.zip for each day Congress is in session. From there it's a fairly simple matter to download and extract the relevant files:
- mods.xml lists metadata for all pages in the archive.
- html/CREC-YYYY-mm-dd-pt1-PgXYZA.htm contains the content with a minimum of easily-discarded markup. Page numbers are designated by a letter (we're interested in S for Senate, H for House, and E for Extensions of remarks) and numbers.
Speaker info is sourced from the @unitedstates project's congressional directory; bill metadata comes from ProPublica; and travel information is available from the office of the Clerk of the House.
Is this a complete concordance?
It is not! The first criterion: grammed or stemmed phrases must appear at least twice to be presented. Everything's counted in the database, so when a phrase is first repeated both occurrences will automatically become visible. If you want to see whether a speaker ever mentioned something, for now you'll need to search the Congressional Record directly.
Second: there's a lot of procedural speech in Congress. Members are constantly asking each other for unanimous consent or the absence of a quorum, yielding or reserving time, announcing committee meetings and others' absences, and so on. That gets filtered out up front, since otherwise it'd overwhelm the counts for actually-relevant grams. The syntactic diversity of the English language makes this difficult, and with hundreds of speakers it's nearly inevitable that some individual formulations will make it through and be counted. Typos and errors in the source text, not to mention errors and omissions in the Concordance's programming itself, can also affect the completeness of the dataset.
The line between the procedural and the individual is drawn more or less precisely between "please join me" and "in [recognizing, congratulating, honoring] so-and-so". Everybody asks other members of Congress to join them, for multiple purposes. But while most congresspeople also commemorate constituents, organizations, and businesses in their districts or states with some frequency, the forms and subjects of those tributes are matters of individual choice and expression.
Third: "personal explanations" for absences and misvotes are also skipped at present, since for every digression on water pollution there are a thousand family commitments and missed flights pushing the overall signal:noise ratio too low.
Aren't extensions of remarks not actually spoken on the floor?
Yes. However, floor time is limited (some members of Congress take to the floor only a few times in a year!) and extensions are explicitly intended to be treated as speech, so those inclusions in the Record are the next best thing for getting a more complete picture.
This does result in some duplication, as extensions are often revisions of previous speech on the floor. But since the act of revision itself emphasizes the content for which the extra effort is being made, the question of whether such double-counted grams are pollution in the dataset or an accurate reflection of meta-textual considerations is not one with a clear answer.
Occasionally it results in a lot of duplication, as with Ed Perlmutter and Sam Graves. They're given to firing off carbon-copy congratulations as extensions, absolutely wrecking their top-10 charts. Addressing this more problematic case is not a priority at the moment due to its low incidence.
Why?
The Obama quote about accountability on the home page is one reason, but more than that, the Lexington Concordance is also an attempt to make incumbency harder.
Incumbents are naturally favored to win elections: after all, they've already won at least one, and have the name recognition and financial head start that come with holding office. It's hard enough for challengers to overcome those built-in advantages, and American institutions like gerrymandering and first-past-the-post voting make for even grimmer prospects.
By now, many American legislators especially have clung to their seats for decades. Such protracted incumbencies tend to inspire complacency and comfort with a status quo which is increasingly untenable outside the halls of the Capitol, with the result that a great number of our aging career politicians now govern in a world and a political era for which they are not equipped and to which they as yet see no reason to adapt. The two-party system represents and works for fewer and fewer Americans: "neither" consistently outpolls either "Democrat" or "Republican", but the same structural forces that benefit incumbents also make it next to impossible for a third party to gain traction.
At the same time, discourse, association, and movement are all more and more subject to observation and analysis both by the government and by private interests -- which latter surveillance is directly enabled by legislators quicker to heed corporate PAC donations and lobbying than their constituents' rights and needs. Just going out in public nowadays is enough to get your picture taken and matched against private facial recognition databases without your knowledge, consent, or control over how your likeness might be used in the future. Websites track your every move and build detailed profiles which are sold and combined in order to advertise to you more effectively (this site is ad-free and uses Plausible to collect anonymous traffic and visit stats). If we're all supposed to take technological panopticism as a fact of life now, we might as well see what we can get out of subjecting its enablers' discourse to the same algorithmic and statistical analysis.
The longer someone serves in Congress, the more they talk. What they talk about, together with their legislative and voting records, can help us better judge their priorities, their effectiveness, and above all, whether their continued service is useful to our own political goals. The Concordance is ultimately a research tool for prospective challengers looking to make the case to voters that they can do better.