Thursday Threads: History and How-To of Search, DPLA Update, Searching for Jim Gray

Ahhhh -- with the annual meeting of the American Library Association out of the way and two major holidays (Canada Day and U.S. Independence Day) behind us, the summer can now start. My formal vacation comes next month, and I haven't yet decided what to do with DLTJ Thursday Threads during that week. While I sort that out, take a look at this weeks threads: a book chapter describing the history and how-to of web search, pointers to a textual and video update on the DPLA project, and an article that examines the efforts to rescue noted computer science professor Jim Gray.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

Twenty-four pages on the history and how-to of web search

Abstract: "In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concepts in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms."
- Indexing the World Wide Web: The Journey So Far, Abhishek Das and Ankit Jain, Next Generation Search Engines: Advanced Models for Information Retrieval, 2011 (to appear).

If you are interested at all in how web search engines work, I highly recommend reading this article -- or at least skimming it. A rough breakdown: 3 pages on the history of web search firms and techniques; 4 pages on indexing techniques; 8 pages on storing/retrieving the index; 3 pages on scaling concerns; 2 pages on relevancy signals; and 2 pages on future topics. You'll learn about inverted indexes, the differences between term-based indexing and phrase-based indexing, and the foundational elements of how user queries are answered.

July 1st update on the Digital Public Library of America from John Palfrey

Update from John Palfrey (9 minutes)

I wanted to share with you an emerging sense of what the DPLA might be, based on discussions on this list, on the wiki, in various blogs, and in the couple of Steering Committee and related meetings (the notes from all of which we've now posted). I was prompted to write up this short statement (which we can work into a new "concept note" and workplan shortly) by Karen Coyle's recent blog post, which made me think about the need to keep saying what the DPLA could be and what it is not, as these things become increasingly apparent.

The answer to "what is the DPLA?" is still "a work in progress," but I think several things are coming into relief. I wanted to try out the description below to see what people thought. I've also posted a video form of roughly the same thing on YouTube for those who prefer video presentation of ideas. [Light editing to turn raw links into hyperlinks.]
- What is the DPLA?, 1-Jul-2011 posting by John Palfrey to the DPLA-Discussion mailing list

Work on the Digital Public Library of America marches on, and although one still can't be sure what will come out of it, there are some flags being set in the ground by its steering committee that has the effect of bringing form to the project. John Palfrey's message says that "the DPLA will consist of five elements: code, metadata, content, tools-and-services, and community." Each element has a paragraph outlining what each will entail.

Searching for Jim Gray: A Technical Overview

On Sunday January 28, 2007, noted computer scientist Jim Gray disappeared at sea in his sloop Tenacious. He was sailing singlehanded, with plans to scatter his mother's ashes near the Farallon Islands, some 27 miles outside San Francisco's Golden Gate. As news of Gray's disappearance spread through his social network, his friends and colleagues began discussing ways to mobilize their skills and resources to help authorities locate Tenacious and rescue Gray. That discussion evolved over days and weeks into an unprecedented civilian search-and-rescue (SAR) exercise involving satellites, private planes, automated image analysis, ocean current simulations, and crowdsourced human computing, in collaboration with the U.S. Coast Guard. The team that emerged included computer scientists, engineers, graduate students, oceanographers, astronomers, business leaders, venture capitalists, and entrepreneurs, many of whom had never met one another before. There was ample access to funds, technology, organizational skills and know-how, and a willingness to work round the clock.

Even with these advantages, the odds of finding Tenacious were never good. On February 16, 2007, in consultation with the Coast Guard and Gray's family, the team agreed to call off the search. Tenacious remains lost to this day, despite a subsequent extensive underwater search of the
San Francisco coastline.

- Searching for Jim Gray: A Technical Overview, by Joseph M. Hellerstein and David L. Tennenhouse, Communications of the ACM, July 2011

The above quotation is the first two paragraphs of a behind-the-scenes view of the ad hoc, crowd-sourced, cross-disciplinary search-and-rescue effort to find Jim Gray. As the last sentence of the quotation notes, Professor Gray was not found. This article shows the depth of spontaneous internet-driven effort to find him, and what was learned in retrospect about such spontaneous efforts.