Thursday Threads: RDA Revolt, Google Book Search Algorithm, Google Helps Improve Web Servers, Google’s Internet Traffic Hugeness

4 minute read

× This article was imported from this blog's previous content management system (WordPress), and may have errors in formatting and functionality. If you find these errors are a significant barrier to understanding the article, please let me know.

Receive DLTJ Thursday Threads by E-mail!

Delivered by FeedBurner

This week is a mostly Google edition of DLTJ Thursday Threads. Below is a high-level overview of Google's Book Search algorithm, how Google is helping web servers improve the speed at which content loads, and how Google's internet traffic is growing as a percentage of all internet traffic. But first, there is an uprising on the RDA test records in the WorldCat database.

If you find these interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments and tips, as always, are welcome.

Memorandum Against RDA Test

We have found ourselves in an unenviable position of opposing the work that supposedly has been authorized by agencies representing our interests. I might compare it to a military coup d’état. I mean here the RDA “test” and its implications on the cataloging world at large. After extensive discussions on the PCC, OCLC cataloging e-mail lists with opinions from the British Library, Australia and North America, we can safely conclude that there is a broad consensus against principles of RDA and the way RDA “test” has been imposed on the cataloging world.

The original post on the OCLC-CAT list by Wojciech Siemaszkiewicz of the New York Public Library is behind a must-subscribe-and-authenticate form, but it has been copied out copied to an open website by Becky Yoose (thanks, Becky!). The subsequent discussion resulted in a Petition against the RDA Test by Jacqueline Byrd at Indiana University. The link to the position has been posted to the open AUTOCAT list, and there has been subsequent discussion there. (Hat tip to Kirsten Davis.)

Inside the Google Books Algorithm

Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 "signals," individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn't just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.

Alexis Madrigal article in draws a comparison between the techniques and algorithms used for web search with those used for book materials. The need for relevant search results is the same, but books don't have the same inter-page linking hints that drive the PageRank algorithm for web search. The use of anonymized circulation data in creating clustered bibliographic descriptions was mentioned at the ALA Midwinter ALCTS Forum on Mashups of Bibliographic Data, and apparently it is also used in the relevance ranking of Google Books search results. (Hat tip to Ron Murray.)

Google Releases mod_pagespeed

mod_pagespeed is an open-source Apache module that automatically optimizes web pages and resources on them. It does this by rewriting the resources using filters that implement web performance best practices. Webmasters and web developers can use mod_pagespeed to improve the performance of their web pages when serving content with the Apache HTTP Server. mod_pagespeed includes several filter that optimize JavaScript, HTML and CSS stylesheets. It also includes filters for optimizing JPEG and PNG images. The filters are based on a set of best practices known to enhance web page performance. Webmasters who set up mod_pagespeed in addition to configuring proper caching and compression on their Apache distribution should expect to see an improvement in the loading time of the pages on their websites.

Google has promoted best practices for improving the rate at which web pages load for a number of years. This week they introduced mod_pagespeed: an Apache web server module that brings these practices to bear by rewriting HTML, JavaScript, and Cascading Style Sheets on-the-fly. Since Google now includes the speed at which pages are rendered in a browser as a factor in ranking search results, this would seem to be a good module to explore for anyone running an Apache web server with public content. (Hat tip to Ed Summers.)

[caption id="" align="alignright" width="280" caption="Google as a percentage of all internet traffic"]A graph showing a rising percentage from roughly one percent in June 2007 to six percent in October 2010[/caption]

Google Sets Internet Traffic Record

Google now represents an average 6.4% of all Internet traffic around the world. This number grows even larger (to as much as 8-12%) if I include estimates of traffic offloaded by the increasingly common Google Global Cache (GGC) deployments and error in our data due to the extremely high degree of Google edge peering with consumer networks. Keep in mind that these numbers represent increased market share — Google is growing considerably faster than overall Internet volumes which are already increasing 40-45% each year.

Craig Labovitz of Arbor Networks notes that if Google were an internet service provider, it would now be "the second largest carrier on the planet." Wow! That is a lot of data sloshing around on its own internal network!

The text was modified to update a link from to on August 22nd, 2012.

The text was modified to update a link from to on September 26th, 2013.