Thursday Threads: Amazon Pressures Publishers, Academic Spam, Mechanical Turk Spam, Multispectral Imaging

With the close of the year approaching, this issue marks the 14th week of DLTJ Thursday Threads. This issue has a publisher's view of Amazon's strong-arm tactics in book pricing, research into the possibility that academic authors could game Google Scholar with spam, demonstrations of how Amazon's Mechanical Turk drives down the cost of enlisting humans to overwhelm anti-spam systems, and a story of multispectral imaging adding information in the process of digital preservation.

As the new year approaches, I wish you the best professionally and personally.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

Books after Amazon

What happens when an industry concerned with the production of culture is beholden to a company with the sole goal of underselling competitors? Amazon is indisputably the king of books, but the issue remains, as Charlie Winton, CEO of the independent publisher Counterpoint Press puts it, “what kind of king they’re going to be.” A vital publishing industry must be able take chances with new authors and with books that don’t have obvious mass-market appeal. When mega-retailers have all the power in the industry, consumers benefit from low prices, but the effect on the future of literature—on what books can be published successfully—is far more in doubt.

—Books After Amazon, Onnesha Roychoudhuri, 1-Nov-2011

Onnesha Roychoudhuri publishes this view of Amazon's marketing practices in the lastest issue of the Boston Review. From the publisher's pespective, the strong-arm tactics described sound horrible. But the story also points to cracks appearing -- at least for the bigger publishers. That may leave smaller, independent publishers in a big squeeze. [Via OCLC Research's Above-the-Fold ]

Academic Search Engine Spam and Google Scholar's Resilience Against it

Abstract: In a previous paper we provided guidelines for scholars on optimizing research articles for academic search engines such as Google Scholar. Feedback in the academic community to these guidelines was diverse. Some were concerned researchers could use our guidelines to manipulate rankings of scientific articles and promote what we call ‘academic search engine spam’. To find out whether these concerns are justified, we conducted several tests on Google Scholar. The results show that academic search engine spam is indeed—and with little effort—possible: We increased rankings of academic articles on Google Scholar by manipulating their citation counts; Google Scholar indexed invisible text we added to some articles, making papers appear for keyword searches the articles were not relevant for; Google Scholar indexed some nonsensical articles we randomly created with the paper generator SciGen; and Google Scholar linked to manipulated versions of research papers that contained a Viagra advertisement. At the end of this paper, we discuss whether academic search engine spam could become a serious threat to Web-based academic search engines.

—Academic Search Engine Spam and Google Scholar's Resilience Against it, Journal of Electronic Publishing, Dec-2010, https://doi.org/10.3998/3336451.0013.305

Joeran Beel and Bela Gipp have this article in the most recent issue of Journal of Electronic Publishing. In addition to being able to game Google Scholar, the authors note that Microsoft Academic Search and CiteSeer (as well as their own academic search engine currently under development -- SciPlore) have the same issues. Although it is possible, we don't know if it is being done -- or even if there would be an penalties in the academic community for doing so.

Mechanical Turk: Now with 40.92% spam

At this point, Amazon Mechanical Turk has reached the mainstream. Pretty much everyone knows about the concept. Post small tasks online, pay people cents, and get thousands of micro-tasks completed. Unfortunately, this resulted in some unfortunate trends. Anyone who frequents just a little bit the market will notice the tremendous number of spammy HITs. (HIT = a task posted for completion in the market; stands for Human Intelligence Task). "Test if the ads in my website work". "Create a Twitter account and follow me". "Like my YouTube video". "Download this app". "Write a positive review on Yelp". A seemingly endless amount of spam HITs come to the market, mainly with the purpose of spamming "social media" metrics. So, with Dahn Tamir and Priya Kanth (MS student at NYU), we decided to examine how big is the problem. How many spammers join the market? How many spam HITs are there?

—Mechanical Turk: Now with 40.92% spamA Computer Scientist in a Business School, 16-Dec-2010

This post from Panos Ipeirotis, Associate Professor at the IOMS Department at Stern School of Business of New York University, describes a review of activities posted to Amazon's Mechanical Turk service. Spam is everywhere, and it appears that the Mechanical Turk is reducing the friction between buyers and workers of spam activity. [Via Ron Murray]

Cutting-Edge Imaging Helps Scholar Reveal 8th-Century Manuscript

With a manuscript like the St. Chad Gospels, multispectral imaging—a series of scans, each based on a single part of the color spectrum—allows his team to create images that have the equivalent of three-dimensional detail, down to revealing the thickness of brush strokes on letters and illustrations. Cockled pages can be virtually flattened out so that all their details can be studied. Studied color band by color band, the chemical composition of ink can be determined.

—21st-Century Imaging Helps Scholars Reveal Rare 8th-Century Manuscript, Chronicle of Higher Education, 5-Dec-2010

This article by Jennifer Howard at the Chrnoicle of Higher Education reviews the story of how 8th-century documents in England were digitized by scholars at the University of Kentucky. It caught my eye because of the mention of multispectral imaging; this is something that the JPEG2000 file format can natively store. Digitization at this level doesn't just provide alternative, online access to documents -- it actually adds new information to the process of researching those documents. [Note: the link is behind a publisher paywall. If you would like to see it, send me an e-mail and I'll forward you a short-term link from the Chronicle's website.]