Thursday Threads: Beyond MARC, Library-controlled DRM, Spam Study

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

Threads this week without commentary. (It has been a long week that included only one flight of four that actually happened without a delay, cancellation, or redirection.) Big announcements are one from the Library of Congress to re-envision the way bibliographic information travels, one from Douglas County (Colorado) Library’s experiment with taking ownership of ebooks and applying its own digital rights management, and a study on the ecosystem of spam.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments and tips, as always, are welcome.

Transforming our Bibliographic Framework: A Statement from the Library of Congress

Spontaneous comments from participants in the US RDA Test show that a broad cross-section of the community feels budgetary pressures but nevertheless considers it necessary to replace MARC 21 in order to reap the full benefit of new and emerging content standards. The Library now seeks to evaluate how its resources for the creation and exchange of metadata are currently being used and how they should be directed in an era of diminishing budgets and heightened expectations in the broader library community.

Also see John Mark Ockerbloom’s Open data’s role in transforming our bibliographic framework for more details and links to other posts talking about the Bibliographic Framework Transition Initiative.

Douglas County Library to Distribute Ebooks with its own DRM

We are pleased to announce a partnership between the Colorado Independent Publishers Association (CIPA), and two Colorado libraries: Red Rocks Community College Library, and Douglas County Libraries.

Many members of CIPA have entered the world of digital publishing. By June of 2011, Red Rocks Community College Library and Douglas County Libraries will not only offer eBooks from CIPA’s authors for checkout through their library catalogs, but will also allow click-through purchases of these titles.

New e-book partnership, Douglas County Libraries

There are more details on a post on the ALA Presidential Task Force on Equitable Access to Electronic Content blog along with an earlier post about that library’s experiments with Adobe Content Server.

Study Says Spam Can Be Cut by Blocking Card Transactions

For years, a team of computer scientists at two University of California campuses has been looking deeply into the nature of spam, the billions of unwanted e-mail messages generated by networks of zombie computers controlled by the rogue programs called botnets. They even coined a term, “spamalytics,” to describe their work.

Now they have concluded an experiment that is not for the faint of heart: for three months they set out to receive all the spam they could (no quarantines or filters need apply), then systematically made purchases from the Web sites advertised in the messages.


p style=”padding:0;margin:0;font-style:italic;” class=”removed_link”>The text was modified to remove a link to on December 10th, 2012.

Thursday Threads: Amazon Pressures Publishers, Academic Spam, Mechanical Turk Spam, Multispectral Imaging

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

With the close of the year approaching, this issue marks the 14th week of DLTJ Thursday Threads. This issue has a publisher’s view of Amazon’s strong-arm tactics in book pricing, research into the possibility that academic authors could game Google Scholar with spam, demonstrations of how Amazon’s Mechanical Turk drives down the cost of enlisting humans to overwhelm anti-spam systems, and a story of multispectral imaging adding information in the process of digital preservation.

As the new year approaches, I wish you the best professionally and personally.

Books After Amazon

What happens when an industry concerned with the production of culture is beholden to a company with the sole goal of underselling competitors? Amazon is indisputably the king of books, but the issue remains, as Charlie Winton, CEO of the independent publisher Counterpoint Press puts it, “what kind of king they’re going to be.” A vital publishing industry must be able take chances with new authors and with books that don’t have obvious mass-market appeal. When mega-retailers have all the power in the industry, consumers benefit from low prices, but the effect on the future of literature—on what books can be published successfully—is far more in doubt.

Onnesha Roychoudhuri publishes this view of Amazon’s marketing practices in the lastest issue of the Boston Review. From the publisher’s pespective, the strong-arm tactics described sound horrible. But the story also points to cracks appearing — at least for the bigger publishers. That may leave smaller, independent publishers in a big squeeze. [Via OCLC Research’s Above-the-Fold]

Academic Search Engine Spam and Google Scholar’s Resilience Against it

Abstract: In a previous paper we provided guidelines for scholars on optimizing research articles for academic search engines such as Google Scholar. Feedback in the academic community to these guidelines was diverse. Some were concerned researchers could use our guidelines to manipulate rankings of scientific articles and promote what we call ‘academic search engine spam’. To find out whether these concerns are justified, we conducted several tests on Google Scholar. The results show that academic search engine spam is indeed—and with little effort—possible: We increased rankings of academic articles on Google Scholar by manipulating their citation counts; Google Scholar indexed invisible text we added to some articles, making papers appear for keyword searches the articles were not relevant for; Google Scholar indexed some nonsensical articles we randomly created with the paper generator SciGen; and Google Scholar linked to manipulated versions of research papers that contained a Viagra advertisement. At the end of this paper, we discuss whether academic search engine spam could become a serious threat to Web-based academic search engines.

Joeran Beel and Bela Gipp have this article in the most recent issue of Journal of Electronic Publishing. In addition to being able to game Google Scholar, the authors note that Microsoft Academic Search and CiteSeer (as well as their own academic search engine currently under development — SciPlore) have the same issues. Although it is possible, we don’t know if it is being done — or even if there would be an penalties in the academic community for doing so.

Mechanical Turk: Now with 40.92% spam

At this point, Amazon Mechanical Turk has reached the mainstream. Pretty much everyone knows about the concept. Post small tasks online, pay people cents, and get thousands of micro-tasks completed. Unfortunately, this resulted in some unfortunate trends. Anyone who frequents just a little bit the market will notice the tremendous number of spammy HITs. (HIT = a task posted for completion in the market; stands for Human Intelligence Task). “Test if the ads in my website work”. “Create a Twitter account and follow me”. “Like my YouTube video”. “Download this app”. “Write a positive review on Yelp”. A seemingly endless amount of spam HITs come to the market, mainly with the purpose of spamming “social media” metrics. So, with Dahn Tamir and Priya Kanth (MS student at NYU), we decided to examine how big is the problem. How many spammers join the market? How many spam HITs are there?

This post from Panos Ipeirotis, Associate Professor at the IOMS Department at Stern School of Business of New York University, describes a review of activities posted to Amazon’s Mechanical Turk service. Spam is everywhere, and it appears that the Mechanical Turk is reducing the friction between buyers and workers of spam activity. [Via Ron Murray]

Cutting-Edge Imaging Helps Scholar Reveal 8th-Century Manuscript

With a manuscript like the St. Chad Gospels, multispectral imaging—a series of scans, each based on a single part of the color spectrum—allows his team to create images that have the equivalent of three-dimensional detail, down to revealing the thickness of brush strokes on letters and illustrations. Cockled pages can be virtually flattened out so that all their details can be studied. Studied color band by color band, the chemical composition of ink can be determined.

This article by Jennifer Howard at the Chrnoicle of Higher Education reviews the story of how 8th-century documents in England were digitized by scholars at the University of Kentucky. It caught my eye because of the mention of multispectral imaging; this is something that the JPEG2000 file format can natively store. Digitization at this level doesn’t just provide alternative, online access to documents — it actually adds new information to the process of researching those documents. [Note: the link is behind a publisher paywall. If you would like to see it, send me an e-mail and I’ll forward you a short-term link from the Chronicle’s website.]

Attempting to Run Comments without reCAPTCHA

I’m trying an experiment over the next couple days/weeks. I’m turning off the reCAPTCHA requirement for blog commenters (the figure-out-these-words-and-type-them-in anti-spam scheme I turned on three and a half years ago). The only automated scheme in place now is Akismet. This change was made Friday night, and over the weekend a few spam comments got through to “approved” status while 550 were in the “spam” queue. With reCAPTCHA in place, I would typically only get 10 or so comments that would make it through reCAPTCHA only to get caught by Akismet (and none through to approved comments). I could easily go through 10 or so comments a day looking for ones that would accidentally get trapped (maybe one a month), but I’m not going through 200 or more a day. So, if you comment on DLTJ and don’t see it immediately posted, please do let me know and I’ll fetch it out of the spam queue.

On Being Fodder for Questionable Twitter Posts

Okay, I know this is starting to seem like an obsession, but I can’t figure out why someone(s) would be constructing tweets that consist of my blog post headlines and links back to my postings. I’m wondering how wide spread this problem is, so I constructed a list of URLs to blog posts based on the Planet Code4Lib Atom feed and pointed them to the Ubervu service. Ubervu has a view into the Twitter firehose, and constructs reports of Twitter mentions of URLs. For instance, I can see all of the odd headline tweets for my previous postings through this service. I can then easily scan through the list for other people that seem to be affected by this strange phenomenon.
Continue reading

Why I Need Twitter Distillation Tools

The following may not be news to those who regularly hang out in Twitter-land, but the extent of the problem recently became clear to me: there is a bunch of spam in Twitter. More specifically, there appear to be robots that do nothing but scan the web for keywords and create tweets with links back to them. There appear to be some that value this service (judging by the number of followers of these Twitter users), but for me it just adds to the general clutter I find in Twitter.

So — here is the situation. Yesterday I posted a blog message that has my upcoming ALA Midwinter meeting plans. I’ve got a WordPress plugin that injects an announcement of that post into my Twitter stream. Since I like my blog to be the definitive source of discussions surrounding my blog posts, I also run another plug-in (from the Backtype service) that takes commentary found in other social media sites and adds them as comments to my blog posting. I’ve set the latter plug-in to add such comments to my “pending” queue rather than posting them automatically.

When I looked at my pending comment queue this morning, I saw that Backtype found not only my own tweet of the post1, but also five others from “people” I haven’t encountered before. (No links here because I don’t want to offer any Google juice if there is something nefarious going on.)

Twitter IDTweet TextFollowersProfile URL
TechnoTrendzMidwinter Meeting Schedule (Plus News of a Free Midwinter Airport …: Next planned event is the discussion mee..,113“You are about to discover how YOU can join the most SECRET underground mastermind group of online money makers that are making 6-7 figures per MONTH…”
ddavilleMidwinter Meeting Schedule (Plus News of a Free Midwinter Airport …: Next planned event is the discussion meeting…,445“Get over 1,000 new high quality followers every week, easily generate an annual income in excess of $100,000…”
EshaWilliamsMidwinter Meeting Schedule (Plus News of a Free Midwinter Airport …,631“Watch the Exciting Video Below to Witness What The World’s Most Powerful Marketing Software Can Do For Your Online Business!”
soslabMidwinter Meeting Schedule (Plus News of a Free Midwinter Airport … profile URL
FrankyConnellyMidwinter Meeting Schedule (Plus News of a Free Midwinter Airport …: Next planned event is the discussion mee.. profile URL

A couple of things to note:

  • In all cases, the Twitter IDs seem to be unlike other spammers — ones that I typically associate with spammers are names with a string of numbers. These look like real names.
  • Three of the five accounts have over 1,000 followers — usually the mark of someone legitimate. Heck…that is more than I have by far!
  • Three accounts (TechnoTrendz, ddaville, and FrankyConnelly) also add an excerpt of text from deep inside the post: “Next planned event is the discussion mee” Two of these three use the same short link.
  • The original post did not use a third-party URL shortener.2 These five posts contain 3 unique short links, with three of the five using the same short link.

All told, this looks suspicious. It is also the sort of thing that leads me to use third-party tools to distill Twitter content into something more manageable and less spam-y. Have others noticed the same thing? Do you have any coping strategies for dealing with the Twitter stream?

Evening Update

Okay, something funky is going on. This post generated seven of these title-plus-short-URL tweets from people I’ve never heard of: viral_veronica (97 followers, no profile URL); Phillips_mktgrp (6,620 followers, profile URL to a broken hosted site); ReclinIncomeRSS (1,649 followers, no profile URL); dmeyer11 (1,696 followers, no profile URL); Tweeting4Cash (7,422 followers, broken profile URL); PaulGoldman123 (10,663 followers, spamy profile URL); and glennsnews (1,264 followers, no profile URL). One other thing I’ve noticed in common with all of these is that their tweets of my blog post headline is coming from the Twitterfeed service. Twitterfeed seems to take an RSS feed and automates the process of creating tweets and Facebook updates and posts to other social networking services. So it would seem that someone is grabbing my blog post feed, or some derivative of a ping-back service or something else, and automatically feeding tweets into Twitter.

So the question would be — for what purpose? To as fodder to mask truly spamy tweets? Because the account owner thinks their followers might all be interested in what I’m saying? What I do know is that this practice — at least for my blog posts — has increased dramatically in the past few weeks. I don’t think this was happening earlier this month…


  1. Oddly, I didn’t get a tweet from InfoPeep — the reposting service based on the Code4Lib Planet. []
  2. I’m happy I have an inherently short URL to start with, so am using yet another WordPress plugin to internally direct users from short URLs to canonical ones. []