“Archiving and Preserving the Web” from the Internet Archive perspective

In case you were wondering what some of the back-channel discussion on the #code4lib IRC channel was on Tuesday, Ed Summers and I were watching an EDUCAUSE webcast on the Internet Archive‘s Archive-It project. Archive-It is a subscription service that allows institutions to crawl and search their own web archive through a web application. On Tuesday, the EDUCAUSE Live! webcast included the project manager and Senior Crawl Engineer (what a title!) from the Internet Archive to talk about not only the server, but the open source web crawler and ARC access tools (copied from the project home page):

  • Tom Emerson’s libarc,
    “A C++ library for processing Internet Archive ARC, CDX, and DAT
    files.” This project used to reside at
    libarc home
    page
    but was moved here, 09/14/2004. See the
    README.
  • NutchWAX is
    Web Archive Collection Search based on
    Nutch.
  • infiniteurl is an
    infinite source of pages used testing crawlers.
  • Hedaern, an ARC
    ‘access’ tool, puts up a WebUI that allows URL+timestamp
    lookups and full-text searching of ARCs. Hedaern is currently
    ‘alpha’ and is LGPL. It is written in python — it includes python
    ARC reader/writers — and was donated by Mark Williamson of the
    British Library. To learn more about Hedaern, start with the
    guide.
  • wera is an archive viewer

    application that gives an Internet Archive Wayback Machine-like
    access to web archive collections. Wera is a php5 application based
    on — and replaces –
    the NwaToolset. Currently wera
    uses NutchWAX as its search engine
    core and the ARCRetriever webpp (included) fetching records from
    ARCs.

  • wayback is an open-source
    version of the Internet Archive Wayback Machine. [As an aside, one of the presenters said that this was a pretty old version and that they were looking to update it soon to bring the open source version more in line with the archive.org "live" version.]
  • WAXToolbar is a firefox
    extension for browsing Web Archives.

The archive of the EDUCAUSE Live! webcast is open to viewing by anyone. The presenter’s PowerPoints and instructions on how to view the webcast archive can be found at http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1. The presentation abstract is included below:

Libraries and archives have long collected information to serve scholars in understanding history, culture, and society. Today, Web pages have replaced newsletters; blogs have supplanted diaries; and many government forms and documents are more readily accessible on the Web than in paper form. As part of an effort to appropriately document and capture today’s information for tomorrow’s use, institutions must adopt a Web archiving strategy. Fortunately, Archive-It takes much of the burden out of the task. Archive-It is a Web application uniquely designed for the needs of university and government institutions interested in preserving Web content. The application allows organizations with limited infrastructure and technical staff to collect, catalog, search, and manage archived Web content through a Web interface. Built on open source components by the Internet Archive and the International Internet Preservation Consortium, Archive-It creates and stores the ARC files that are the standard format for Web archiving. In this presentation, two representatives from the Internet Archive will discuss the Archive-It project.

The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/libarc/LOCAL_README.txt?rev=1 to http://libarc.cvs.sourceforge.net/viewvc/libarc/libarc/README?revision=1.5.

The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/libarc/ to http://libarc.cvs.sourceforge.net/viewvc/libarc/libarc/.

The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/ to http://archive-access.cvs.sourceforge.net/viewvc/archive-access/archive-access/projects/hedaern/.

The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/docs/guide.pdf?rev=1 to http://archive-access.cvs.sourceforge.net/viewvc/archive-access/archive-access/projects/hedaern/docs/guide.pdf?revision=1.1.

The text was modified to update a link from http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1 to http://www.educause.edu/library/resources/archiving-and-preserving-web on August 22nd, 2012.

The text was modified to update a link from http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1 to http://www.educause.edu/library/resources/archiving-and-preserving-web on August 22nd, 2012.

(This post was updated on 22-Aug-2012.)