Thursday Threads: Machine-Meaningful Web Content and Successful IPv6 Test

by  Peter E. Murray  ·   Posted on 
 ·  8 minutes reading time

Two threads this week: the first is an announcement from the major search engine on a way they agree to discover machine-processable information in web pages. The search engines want this so they can do a better job understanding the information web pages, but it stomps on the linked data work that has been a hot topic in libraries recently. The second is a red-letter day in the history of the internet as major services tried out a new way for machines to connect. The test was successful, and its success means a big hurdle has been crossed as the internet grows up.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

Introducing schema.org: Search engines come together for a richer web

Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.

- schema.org – Home

As the quote above suggests, much of the data on the web starts as structured information, but HTML by itself lacks the semantic hooks to easily bring meaning to that information. Search engine robots and indexing algorithms have to try to infer the meaning of bits of information from surrounding context. Sometimes they can get really good at it, and sometimes not. So last week Google, Microsoft Bing, and Yahoo! announced a project to promote machine-readable markup for structured data on web pages. What does this mean? Take this example from this documentation page on how to describe an event:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<div itemscope itemtype="http://schema.org/Event">
  <a itemprop="url" href="nba-miami-philidelphia-game3.html">
  NBA Eastern Conference First Round Playoff Tickets:
  Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)
  </a>
  <time itemprop="startDate" datetime="2011-04-21T20:00">
    Thu, 04/21/11
    8:00 p.m.
  </time>
  <div itemprop="location" itemscope itemtype="http://schema.org/Place">
    <a itemprop="url" href="wells-fargo-center.html">
    Wells Fargo Center
    </a>
    <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
      <span itemprop="addressLocality">Philadelphia</span>,
      <span itemprop="addressRegion">PA</span>
    </div>
  </div>
  <div itemprop="offers" itemscope itemtype="http://schema.org/AggregateOffer">
    Priced from: <span itemprop="lowPrice">$35</span>
    <span itemprop="offerCount">1,938</span> tickets left
  </div>
</div>

There are a number of things going on here. Line #1 marks the beginning of the description of an event, and that event will have a URL for more information (line 2), a start time (line 6), a location (lines 10 through 19) with an address (lines 14 through 19), and an item for sale -- in this case, a ticket (lines 20 through 24). With this information, the search engines can more easily understand content on web pages to put them on a map or compare prices.

This, of course, intersects with what libraries have been doing for a long time -- describing things to make them easier for library patrons to find. The creators behind 'schema.org' have thought about this as well because they have ways of describing creative works of various types and the people associated with them. If you have wondered what the Semantic Web and Linked Data was about, this is sort of an example of what people have been trying to do to bring "more intelligence" to the data encoded on our web pages.

The 'schema.org' announcement, of course, hasn't been without controversy. In going down the path that they did, Google and Bing and Yahoo appear to have dismissed much of the last decade of work to bring the semantic web vision to fruition. (Primarily, using a standard called RDFa to embed this information in HTML pages.) Several people have commented on the project, and an effort was started to unify the 'schema.org' proposal with the semantic web work. It remains to be seen, though, how what the impacts of this work will be.

World IPv6 Day Happens With Few Problems

WORLD IPV6 DAY is 8 June 2011 – The Future is Forever

The nation's largest telecom carriers, content providers, hardware suppliers and software vendors will be on the edge of their seats tonight for the start of World IPv6 Day, which is the most-anticipated 24 hours the tech industry has seen since fears of the Y2K bug dominated New Year's Eve in 1999. More than 400 organizations are participating in World IPv6 Day, a large-scale experiment aimed at identifying problems associated with IPv6, an upgrade to the Internet's main communications protocol known as IPv4.

- World IPv6 Day: Tech industry's most-watched event since Y2K, by Carolyn Duffy Marsan, Network World, 7-Jun-2011

Internet administrators were 99.9999% sure that World IPv6 Day would go by without any real problems. Of course, when you’re dealing with something as big as the Internet, even six nines of up-time could mean hundreds of thousands of users with trouble. So far, though, all is well. ...

We now know that the IPv6 Internet can co-exist peacefully with the IPv4 Internet. That’s a darn good thing since we can look forward to more than a decade of the two them working side by side as the older IPv4 Internet slowly fades away.

- The World IPv6 Day report card, by Steven J. Vaughan-Nichols, ZDNet, 8-Jun-2011

As noted in a previous DLTJ Thursday Thread, the internet is running out of addresses for devices on the network. A decade and a half ago the Internet Engineering Task Force realized this and created a next-generation standard called Internet Protocol version 6, or IPv6, that would solve this problem by dramatically increasing the number of possible addresses (340,282,366,920,938,463,463,374,607,431,768,211,456 versus 4,294,967,296 with IPv4). Unfortunately, for deep technical reasons, it is not easy to conversion from one to the other; they coexist peacefully but one needs a gateway to translate between the two. And because IPv4 is embedded in lots of devices like printers and copiers and phones and set-top TV boxes and washing machines (any other strange internet-connected devices?), IPv4 is going to be with us for a while.

The test on June 8th was to have major internet services and networks such as Google/YouTube, Facebook, and Yahoo! put their servers on IPv4 and IPv6 at the same name (e.g. "www.google.com"). For the most part, the test proved successful. So the march to IPv6-for-all is on. Are you a software developer? And if so have you considered what changes you'll need to make to your code (as in "IP address-based access restrictions")?