"Archiving and Preserving the Web" from the Internet Archive perspective
In case you were wondering what some of the back-channel discussion on the #code4lib IRC channel was on Tuesday, Ed Summers and I were watching an EDUCAUSE webcast on the Internet Archive's Archive-It project. Archive-It is a subscription service that allows institutions to crawl and search their own web archive through a web application. On Tuesday, the EDUCAUSE Live! webcast included the project manager and Senior Crawl Engineer (what a title!) from the Internet Archive to talk about not only the server, but the open source web crawler and ARC access tools (copied from the project home page):
- Tom Emerson's libarc,
"A C++ library for processing Internet Archive ARC, CDX, and DAT
files." This project used to reside at
libarc home
page but was moved here, 09/14/2004. See the
README. - NutchWAX is
Web Archive Collection Search based on
Nutch. - infiniteurl is an
infinite source of pages used testing crawlers. - Hedaern, an ARC
'access' tool, puts up a WebUI that allows URL+timestamp
lookups and full-text searching of ARCs. Hedaern is currently
'alpha' and is LGPL. It is written in python -- it includes python
ARC reader/writers -- and was donated by Mark Williamson of the
British Library. To learn more about Hedaern, start with the
guide. -
wera is an archive viewer
application that gives an Internet Archive Wayback Machine-like
access to web archive collections. Wera is a php5 application based
on -- and replaces --
the NwaToolset. Currently wera
uses NutchWAX as its search engine
core and the ARCRetriever webpp (included) fetching records from
ARCs. - wayback is an open-source
version of the Internet Archive Wayback Machine. [As an aside, one of the presenters said that this was a pretty old version and that they were looking to update it soon to bring the open source version more in line with the archive.org "live" version.] - WAXToolbar is a firefox
extension for browsing Web Archives.
The archive of the EDUCAUSE Live! webcast is open to viewing by anyone. The presenter's PowerPoints and instructions on how to view the webcast archive can be found at http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1. The presentation abstract is included below:
Libraries and archives have long collected information to serve scholars in understanding history, culture, and society. Today, Web pages have replaced newsletters; blogs have supplanted diaries; and many government forms and documents are more readily accessible on the Web than in paper form. As part of an effort to appropriately document and capture today's information for tomorrow's use, institutions must adopt a Web archiving strategy. Fortunately, Archive-It takes much of the burden out of the task. Archive-It is a Web application uniquely designed for the needs of university and government institutions interested in preserving Web content. The application allows organizations with limited infrastructure and technical staff to collect, catalog, search, and manage archived Web content through a Web interface. Built on open source components by the Internet Archive and the International Internet Preservation Consortium, Archive-It creates and stores the ARC files that are the standard format for Web archiving. In this presentation, two representatives from the Internet Archive will discuss the Archive-It project.
The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/libarc/LOCAL_README.txt?rev=1 to http://libarc.cvs.sourceforge.net/viewvc/libarc/libarc/README?revision=1.5.
The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/libarc/ to http://libarc.cvs.sourceforge.net/viewvc/libarc/libarc/.
The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/ to http://archive-access.cvs.sourceforge.net/viewvc/archive-access/archive-access/projects/hedaern/.
The text was modified to update a link from http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/docs/guide.pdf?rev=1 to http://archive-access.cvs.sourceforge.net/viewvc/archive-access/archive-access/projects/hedaern/docs/guide.pdf?revision=1.1.
The text was modified to remove a link to http://nwa.nb.no/ on December 31st, 2010.
The text was modified to remove a link to http://nutch.org on December 31st, 2010.
The text was modified to update a link from http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1 to http://www.educause.edu/library/resources/archiving-and-preserving-web on August 22nd, 2012.
The text was modified to update a link from http://www.educause.edu/content.asp?SECTION_ID=201&bhcp=1 to http://www.educause.edu/library/resources/archiving-and-preserving-web on August 22nd, 2012.