Anatomy of the Zotero Library to RSS Feed Pipe

Note! A new feature on the Zotero website does away with the need to use this Yahoo! Pipe. RSS feeds are now generated by the Zotero website itself. Read more about it on the Zotero blog.

Last week I posted about a Yahoo Pipes construct that turns a Zotero website library into an RSS feed. As Dan Cohen noted in a twitter response to the posting, the Zotero team is planning to add an RSS capability in a future release of the website, so this pipe will ultimately be usurped by that capability, but in the meantime it is a handy tool. It was my first full-scale foray into creating a Yahoo Pipes construct from scratch, so I thought it would be useful to document how it works (in case I need to do something similar again). You might find this useful, too; especially the part about how to put a pubDate element into the RSS feed.

[caption id="attachment_788" align="alignnone" width="927" caption="Graphic representation of the Yahoo! Pipes construct to turn a Zotero library web page into an RSS feed."]

A couple of things to note:

The basic structure is to iterate over a list of items. The modules on the right side (above) work to construct a URL to the Zotero Library page while the modules on the left actually get that page, find the list of items (by defining where the list is located in the HTML structure), and massage them into an RSS feed. In the case of this example, the Pipe is iterating over a list of MARKDOWN_HASHc5a022f415caf33831dd2bd16bb78acdMARKDOWN_HASH elements in the table that displays the library items.
Strings with elements separated by periods are XPATH-like constructs. For instance, the "Path to Item List" in the upper right module is MARKDOWN_HASHde59e576fea258ba884e4a675dc71b1aMARKDOWN_HASH. The corresponding XPATH is MARKDOWN_HASHa35f832ec98dad177813cb1ff5a014e6MARKDOWN_HASH. Note a few things: the leading element (MARKDOWN_HASHf8157d1ab1b65c3518e92989802d8ab2MARKDOWN_HASH) is omitted, slashes are replaced by periods, and ordinal numbers are separated from their base elements with periods and are subtracted by 1. Later on there are examples where attribute names ("MARKDOWN_HASHbd25526c088b87e9cb6d3660bd3b1c5fMARKDOWN_HASH") and references to text nodes ("MARKDOWN_HASH17c174cbe0e5faa4f2895a15e0c70c5dMARKDOWN_HASH") are also included in these path strings.

[/caption]

The process starts with the two Private Text Input modules at the top right — one each for the Zotero Username and the Zotero User Number. The defaults are set to my values, and they are marked private — meaning that if someone clones this pipe, these values are not carried along.

Directly below is a URL Builder module. The base URL is http://www.zotero.org, and there are three path elements: the first is a connection from the Zotero Username Private Text Input, the second is a connection from the Zotero User Number Private Text Input, and the third is the literal "items". This builds a URL that looks like http://www.zotero.org/dltj/683/items and that corresponds to the Zotero user's library items page.

Starting at the upper left of the diagram, the output of the URL Builder is connected to the URL field of a Fetch Data module. The "Path to Item List" parameter is set to body.div.1.div.div.1.table.tbody.tr and that is a pointer to the portion of the XHTML document that contains the library items. Because the Zotero website is outputting XML (as XHTML), we can use the Fetch Data module and parse the page as if it was an XML document. The Path to Item List is an XPATH-like structure that points into the document structure (see note above). The result of this module is a list of items — the table rows in this case — that are processed by the remaining modules.

The next module down is a Rename module, where the value of the XPATH-like path item.td.0.a.content is copied to the item title element. The XPATH, from the root of the "Path to Item List" in the module above, is td/a; note here the added item at the front and the content at the end. Specifying the zeroth td element isn't needed, but it brings symmetry with subsequent modules. content corresponds to the text node under the a element when viewing this as an XML document.

What follows is a series of Loop modules that act on different parts of the items in the list. The first builds the link element of the item using the String Builder module. The href in the XML is a relative path, so the String Builder adds the literal "http://www.zotero.org" to the value found in item.td.0.a.href (the href attribute of the anchor element of the first td element). The resulting string is assigned to the link element of each item in the list.

The second Loop module encapsulates a Date Builder module, and this is the inspiration for writing this post. It took a very long time to figure out how to get the pubDate element into each item of the resulting RSS feed. As it turns out, one cannot simple assign the pubDate element like we did the title element above. Instead, one sets the timestamp to the y:published element and Yahoo Pipes takes it from there. And it isn't enough to assign a text string to that element; it has to be a Date type, constructed using the Date Builder module. The Date Builder module is very flexible in what it accepts, and it creates a canonical timestamp form that can be used by other modules. In this case, the Date Builder module takes as a source input the string value found at item.td.2.a.content. Believe me -- it took a long time to figure this out, and it was only done by piecing together various suggestions and examples; there doesn't seem to be any clear documentation about this.

The third and fourth Loop modules go together. The third takes the value found in item.td.0.a.span.class and applies a String Regex module to it. The value of that class attribute contains the type of item in the Zotero library, and it takes the form of "img-book" or "img-journalArticle" or "img-conferencePaper". There are two regular expressions defined: the first removes the "img-" prefix from each value and the second replaces all instances of an upper case letter with a space plus the lower case version of the letter. The latter rule turns "journalArticle" into "journal article" (note that there is a space in this field prior to the \L part). The result is assigned to the item.itemType element. This is used in the final Loop module to build a item.description element to create the string:

This journal article was saved to my Zotero library.

That's all there is to it. Yahoo Pipes applies all of these modules to each of the items in the list retrieved from the Zotero library page and generates the corresponding RSS feed.