These list items are microformat entries and are hidden from view.
https://dltj.org/article/crs-rss/
Peter Murray
One of the hidden gems of the Library of Congress is the Congressional Research Service (CRS). With a staff of about 600 researchers, analysts, and writers, the CRS provides “policy and legal analysis to committees and Members of both the House and Senate, regardless of party affiliation.” It is kind of like a “think tank” for the members of Congress. And an extensive selection of their reports are available from the CRS homepage and—as government publications—are not subject to copyright; any CRS Report may be reproduced and distributed without permission. And they publish a lot of reports. (Read more on their CRS frequently-asked-questions page.)I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced. At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation. Useful stuff! And that wasn’t the only useful report I (re-)discovered on the site.An Automated Syndication FeedThe problem is that no automated RSS/Atom feeds of CRS reports exists. Use your favorite search engine to look for “Congressional Research Service RSS or Atom”; you’ll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago.And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known. So I created a syndication feed in Atom: https://feeds.dltj.org/crs.xmlYou can subscribe to that in your feed reader to get updates. I’m also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.Some Important CaveatsThe CRS website is very resistant to scraping, so I’m having to run this on my home machine (read more below). I’m also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors. The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won’t see increased traffic on their search site. So while I hope to keep it updated, you’ll understand if it misses a batch run here or there.Also, hopefully, looking at the website’s list of reports only twice a day won’t raise flags with them and get my home IP address banned from the service. If the feed stops being updated over an extended time, that is probably why.There is no tracking embedded in the Atom syndication feed or the links to the CRS reports. I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read. (I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really…what’s the point?)How It’s BuiltIf you are not interested in the technology behind how the feed was built, you can stop reading now. If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on.You can also see the source code on GitHub.Obstacle #1: Browser DetectionThe CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages. The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters. That should make this easy, right? Just bypass the JavaScript front end and parse the JSON output directly.In fact, you can do this yourself. Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports. Try to reach that URL with a program, though, and you’ll get back an HTTP 403 error. (In my case, I was using the Python Requests library.) And I tried everything I could think about. I even tried getting the curl command line with the headers that the browser was using from the Firefox web developer tools:curl -v 'https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&' \ -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' \ -H 'Accept: application/json, text/plain, */*' \ -H 'Accept-Language: en-US,en;q=0.5' \ -H 'Accept-Encoding: gzip, deflate, br' \ -H 'Connection: keep-alive' \ -H 'Referer: https://crsreports.congress.gov/search/' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: same-origin' \ -H 'Pragma: no-cache' \ -H 'Cache-Control: no-cache' \ -H 'TE: trailers'…and still got denied. So I gave up and used Selenium to run a headless browser to get the JSON content.And that worked.Obstacle #2: Cloudflare bot detectionSo with the headless browser, I got this working on my local machine. That isn’t really convenient, though…even though my computer is on most working hours, something like this should be run on a server in the cloud. Something like AWS Lambda is ideal. So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python). This is a technique to run Chrome on a server, just like I was doing on my local machine.So I got the code working on AWS Lambda. It was a nice bit of work…I was pleased to learn about a new AWS skill (Layers for Lambda). But I hit another wall…this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do. Instead of the JSON response, I got Cloudflare’s HTML page asking me to solve a captcha to prove my bot’s humanness. And look…I love y’all, but I won’t be answering captcha challenges twice a day to get the report syndication feed published.So after all of that, I decided to just run the code locally. If you know of something I missed that could bypass obstacles 1 and 2 (and won’t get the FBI knocking at my door), please let me know.
2023-05-29T00:00:00+00:00
2024-07-20T16:35:17+00:00

Congressional Research Service Syndication Feed

Posted on May 29, 2023 and updated on July 20, 2024 4 minute read

One of the hidden gems of the Library of Congress is the Congressional Research Service (CRS). With a staff of about 600 researchers, analysts, and writers, the CRS provides “policy and legal analysis to committees and Members of both the House and Senate, regardless of party affiliation.” It is kind of like a “think tank” for the members of Congress. And an extensive selection of their reports are available from the CRS homepage and—as government publications—are not subject to copyright; any CRS Report may be reproduced and distributed without permission. And they publish a lot of reports. (Read more on their CRS frequently-asked-questions page.)

I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced. At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation. Useful stuff! And that wasn’t the only useful report I (re-)discovered on the site.

An Automated Syndication Feed

The problem is that no automated RSS/Atom feeds of CRS reports exists. Use your favorite search engine to look for “Congressional Research Service RSS or Atom”; you’ll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago. And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known. So I created a syndication feed in Atom:

https://feeds.dltj.org/crs.xml

You can subscribe to that in your feed reader to get updates. I’m also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.

Some Important Caveats

The CRS website is very resistant to scraping, so I’m having to run this on my home machine (read more below). I’m also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors. The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won’t see increased traffic on their search site. So while I hope to keep it updated, you’ll understand if it misses a batch run here or there.

Also, hopefully, looking at the website’s list of reports only twice a day won’t raise flags with them and get my home IP address banned from the service. If the feed stops being updated over an extended time, that is probably why.

There is no tracking embedded in the Atom syndication feed or the links to the CRS reports. I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read. (I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really…what’s the point?)

How It’s Built

If you are not interested in the technology behind how the feed was built, you can stop reading now. If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on. You can also see the source code on GitHub.

Obstacle #1: Browser Detection

The CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages. The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters. That should make this easy, right? Just bypass the JavaScript front end and parse the JSON output directly.

In fact, you can do this yourself. Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports. Try to reach that URL with a program, though, and you’ll get back an HTTP 403 error. (In my case, I was using the Python Requests library.) And I tried everything I could think about. I even tried getting the curl command line with the headers that the browser was using from the Firefox web developer tools:

curl -v 'https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br' \
  -H 'Connection: keep-alive' \
  -H 'Referer: https://crsreports.congress.gov/search/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'TE: trailers'

…and still got denied. So I gave up and used Selenium to run a headless browser to get the JSON content.

And that worked.

Obstacle #2: Cloudflare bot detection

So with the headless browser, I got this working on my local machine. That isn’t really convenient, though…even though my computer is on most working hours, something like this should be run on a server in the cloud. Something like AWS Lambda is ideal. So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python). This is a technique to run Chrome on a server, just like I was doing on my local machine.

So I got the code working on AWS Lambda. It was a nice bit of work…I was pleased to learn about a new AWS skill (Layers for Lambda). But I hit another wall…this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do. Instead of the JSON response, I got Cloudflare’s HTML page asking me to solve a captcha to prove my bot’s humanness. And look…I love y’all, but I won’t be answering captcha challenges twice a day to get the report syndication feed published.

So after all of that, I decided to just run the code locally. If you know of something I missed that could bypass obstacles 1 and 2 (and won’t get the FBI knocking at my door), please let me know.

Social Media Interactions

Reposts

Likes

Discussion

Peter Murray

@carlmalamud I was thinking of you and @brewsterkahle as I was building this Atom feed for CRS reports. Free the information!

29 May 2023 | Permalink
James Mitchell

@dltj that's an interesting set of challenges. I wonder how much of the difficulty is legacy from the LC catalog - I'm not gonna talk about why that went behind Cloudflare, but I was assisting LC for that event.

29 May 2023 | Permalink
Stephen Michael Kellat

@dltj I’ve added that feed to my NetNewsWire instance. Eventually it’ll show up in my tracking repo at https://code.launchpad.net/~skellat/+git/FeedReadingFeeds

lp:~skellat/+git/FeedReadingFeeds : Git : Code : Stephen Michael Kellat

29 May 2023 | Permalink
ResearchBuzz

@dltj Thank you! I have queued that for ResearchBuzz.

30 May 2023 | Permalink
Peter Murray

Examples of Congressional Research Services reports from just last week:

* A 2-page summary on the country of Paraguay: https://crsreports.congress.gov/product/details?prodcode=IF12207

* A 2-page introduction to Tort Law: https://crsreports.congress.gov/product/details?prodcode=IF11291

* 7 pages on everyone's favorite topic...the federal debt limit: https://crsreports.congress.gov/product/details?prodcode=R47574

RSS feed URL: https://feeds.dltj.org/crs.xml

30 May 2023 | Permalink
Molly B

@dltj So you’re basically saving the page locally, and then scraping it to populate the feed? (I also need a feed for a page that doesn’t have one, & none of the feed generators I’ve found can do it)

30 May 2023 | Permalink
Peter Murray

@mjibrower That is what I thought I'd have to do at the start, but the CRS website sends the data back to the browser as a nice-parsed JSON file. See, for example, https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&

30 May 2023 | Permalink
Peter Murray

@mjibrower Gary Price pointed me towards a couple other sources of similar reports (House of Commons Library (UK), the European Parliament Research Library, and the California Leg, Analyst office). So I might make a cottage industry out of this sort of thing. What page are you looking to have a feed for?

30 May 2023 | Permalink
30 May 2023 | Permalink
31 May 2023 | Permalink
31 May 2023 | Permalink

Share on

Mastodon/Fediverse Twitter Facebook LinkedIn