Thursday Threads: RDA Revolt, Google Book Search Algorithm, Google Helps Improve Web Servers, Google's Internet Traffic Hugeness
This week is a mostly Google edition of DLTJ Thursday Threads. Below is a high-level overview of Google's Book Search algorithm, how Google is helping web servers improve the speed at which content loads, and how Google's internet traffic is growing as a percentage of all internet traffic. But first, there is an uprising on the RDA test records in the WorldCat database.
Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.
Memorandum Against RDA Test
We have found ourselves in an unenviable position of opposing the work that supposedly has been authorized by agencies representing our interests. I might compare it to a military coup d’état. I mean here the RDA “test” and its implications on the cataloging world at large. After extensive discussions on the PCC, OCLC cataloging e-mail lists with opinions from the British Library, Australia and North America, we can safely conclude that there is a broad consensus against principles of RDA and the way RDA “test” has been imposed on the cataloging world.
The original post on the OCLC-CAT list by Wojciech Siemaszkiewicz of the New York Public Library is behind a must-subscribe-and-authenticate form, but it has been copied out copied to an open website by Becky Yoose (thanks, Becky!). The subsequent discussion resulted in a Petition against the RDA Test by Jacqueline Byrd at Indiana University. The link to the position has been posted to the open AUTOCAT list, and there has been subsequent discussion there. (Hat tip to Kirsten Davis.)
Inside the Google Books Algorithm
Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 "signals," individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn't just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.
Alexis Madrigal article in TheAtlantic.com draws a comparison between the techniques and algorithms used for web search with those used for book materials. The need for relevant search results is the same, but books don't have the same inter-page linking hints that drive the PageRank algorithm for web search. The use of anonymized circulation data in creating clustered bibliographic descriptions was mentioned at the ALA Midwinter ALCTS Forum on Mashups of Bibliographic Data, and apparently it is also used in the relevance ranking of Google Books search results. (Hat tip to Ron Murray.)
Google Releases mod_pagespeed
mod_pagespeed is an open-source Apache module that automatically optimizes web pages and resources on them. It does this by rewriting the resources using filters that implement web performance best practices. Webmasters and web developers can use mod_pagespeed to improve the performance of their web pages when serving content with the Apache HTTP Server. mod_pagespeed includes several filter that optimize JavaScript, HTML and CSS stylesheets. It also includes filters for optimizing JPEG and PNG images. The filters are based on a set of best practices known to enhance web page performance. Webmasters who set up mod_pagespeed in addition to configuring proper caching and compression on their Apache distribution should expect to see an improvement in the loading time of the pages on their websites.
Google has promoted best practices for improving the rate at which web pages load for a number of years. This week they introduced mod_pagespeed: an Apache web server module that brings these practices to bear by rewriting HTML, JavaScript, and Cascading Style Sheets on-the-fly. Since Google now includes the speed at which pages are rendered in a browser as a factor in ranking search results, this would seem to be a good module to explore for anyone running an Apache web server with public content. (Hat tip to Ed Summers.)
Google Sets Internet Traffic Record
Google now represents an average 6.4% of all Internet traffic around the world. This number grows even larger (to as much as 8-12%) if I include estimates of traffic offloaded by the increasingly common Google Global Cache (GGC) deployments and error in our data due to the extremely high degree of Google edge peering with consumer networks. Keep in mind that these numbers represent increased market share — Google is growing considerably faster than overall Internet volumes which are already increasing 40-45% each year.
Craig Labovitz of Arbor Networks notes that if Google were an internet service provider, it would now be "the second largest carrier on the planet." Wow! That is a lot of data sloshing around on its own internal network!
The text was modified to update a link from http://dltj.org/wp-content/uploads/2010/11/wa.exe?A2=ind1011a&L=oclc-cat&D=0&F=P&T=0&X=5D7D8800A8770C99F0&P=2298 to http://listserv.oclc.org/scripts/wa.exe?A2=ind1011a&L=oclc-cat&D=0&F=P&T=0&X=5D7D8800A8770C99F0&P=2298 on August 22nd, 2012.
The text was modified to update a link from http://dltj.org/wp-content/uploads/2010/11/wa.exe?A2=ind1011a&L=oclc-cat&D=0&F=P&T=0&X=5D7D8800A8770C99F0&P=2298 to http://listserv.oclc.org/scripts/wa.exe?A2=ind1011a&L=oclc-cat&D=0&F=P&T=0&X=5D7D8800A8770C99F0&P=2298 on September 26th, 2013.