Congressional Research Service Syndication Feed

 Posted on 
 ·  5 minutes reading time

One of the hidden gems of the Library of Congress is the Congressional Research Service (CRS). With a staff of about 600 researchers, analysts, and writers, the CRS provides "policy and legal analysis to committees and Members of both the House and Senate, regardless of party affiliation." It is kind of like a "think tank" for the members of Congress. And an extensive selection of their reports are available from the CRS homepage and—as government publications—are not subject to copyright; any CRS Report may be reproduced and distributed without permission. And they publish a lot of reports. (Read more on their CRS frequently-asked-questions page.)

I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced. At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation. Useful stuff! And that wasn't the only useful report I (re-)discovered on the site.

An Automated Syndication Feed

The problem is that no automated RSS/Atom feeds of CRS reports exists. Use your favorite search engine to look for "Congressional Research Service RSS or Atom"; you'll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago. And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known. So I created a syndication feed in Atom:

https://feeds.dltj.org/crs.xml

You can subscribe to that in your feed reader to get updates. I'm also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.

Some Important Caveats

The CRS website is very resistant to scraping, so I'm having to run this on my home machine (read more below). I'm also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors. The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won't see increased traffic on their search site. So while I hope to keep it updated, you'll understand if it misses a batch run here or there.

Also, hopefully, looking at the website's list of reports only twice a day won't raise flags with them and get my home IP address banned from the service. If the feed stops being updated over an extended time, that is probably why.

There is no tracking embedded in the Atom syndication feed or the links to the CRS reports. I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read. (I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really...what's the point?)

How It's Built

If you are not interested in the technology behind how the feed was built, you can stop reading now. If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on. You can also see the source code on GitHub.

Obstacle #1: Browser Detection

The CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages. The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters. That should make this easy, right? Just bypass the JavaScript front end and parse the JSON output directly.

In fact, you can do this yourself. Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports. Try to reach that URL with a program, though, and you'll get back an HTTP 403 error. (In my case, I was using the Python Requests library.) And I tried everything I could think about. I even tried getting the curl command line with the headers that the browser was using from the Firefox web developer tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl -v 'https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br' \
  -H 'Connection: keep-alive' \
  -H 'Referer: https://crsreports.congress.gov/search/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'TE: trailers'

...and still got denied. So I gave up and used Selenium to run a headless browser to get the JSON content.

And that worked.

Obstacle #2: Cloudflare bot detection

So with the headless browser, I got this working on my local machine. That isn't really convenient, though...even though my computer is on most working hours, something like this should be run on a server in the cloud. Something like AWS Lambda is ideal. So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python). This is a technique to run Chrome on a server, just like I was doing on my local machine.

So I got the code working on AWS Lambda. It was a nice bit of work...I was pleased to learn about a new AWS skill (Layers for Lambda). But I hit another wall...this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do. Instead of the JSON response, I got Cloudflare's HTML page asking me to solve a captcha to prove my bot's humanness. And look...I love y'all, but I won't be answering captcha challenges twice a day to get the report syndication feed published.

So after all of that, I decided to just run the code locally. If you know of something I missed that could bypass obstacles 1 and 2 (and won't get the FBI knocking at my door), please let me know.