intertwingly

It’s just data

Filtering Feeds with XSLT


XML.com is a group blog, and if you look at its feed using an aggregator like Bloglines or Google Reader you will see the author of each entry identified immediately below the title.  Other aggregators may display this information in different places.  With Venus, your template could put this information any place you like.

But not all feeds are this straightforward.  Following is a discussion of how to do additional normalization on three feeds.

Tim Bray

If you take a look at Tim Bray’s comment feed using Bloglines or Google Reader you will see the author of each comment’s name.  Twice.  The reason?  Tim puts in the author element and redundantly places this information in the content.

Is it valid for Tim to structure his feeds this way?  Yes.  Is it ideal?  That’s certainly open for debate.  It certainly causes no harm, but if you do have a tool which understands atom:author (or dc:creator or can parse the name out of the comments that often accompany an RSS 2.0 author element), it makes this feed a bit different than all of the others.  Not as bad as some of the feeds described below, but this is the easiest one to fix, so I’m starting with this one.

Stripping the redundant content is easy using XSLT, particularly as Tim has marked such paragraphs with class="from" attributes.  Example:

<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="comment"><p class="from">From: <a href="http://tbray.org/ongoing/">Tim Bray</a></p>

First you start with an identity transform.  Such a transform matches all nodes and all attributes, copies what it matches, and then applies templates on all nested content.

Then you define a more specific rule to match the text you want to strip.  Something like this will do just fine:

<xsl:template match="xhtml:p[@class='from']"></xsl:template>

What this does is matches all such divs and replaces them with nothing.  One can make the match more specific and make use of the empty tag syntax to come up with the following final form:

<xsl:template match="atom:content/xhtml:div/xhtml:div/xhtml:p[@class='from']"/>

Radar O’Reilly

Radar O’Reilly is a group blog by the same company that produces the xml.com feed, but this feed when viewed through Bloglines and Google Reader also exhibits the same issue of showing the name twice.

This time the is is a bit obscured by the content being stored as double escaped HTML, and without a marker class attribute to look for.  Example:

<author>
  <name>Tim O'Reilly</name>
  <uri>http://tim.oreilly.com/opensource/paradigmshift_0504.html</uri>
</author>
<content type="html" xml:lang="en" xml:base="http://radar.oreilly.com/">
        &lt;p&gt;By Tim O'Reilly&lt;/p&gt;

The normalization that Venus performs helps a bit, as what is actually passed to the filter is:

<content type="xhtml" xml:lang="en"><div xmlns="http://www.w3.org/1999/xhtml"><p>By Tim O'Reilly</p>

That’s something we can work with.  But again, there is the issue of the unmarked paragraph.  We certainly don’t want to delete all paragraphs.  Perhaps we could get away with deleting the first paragraph, but that wouldn’t be good if the layout changed.

What we can do, however, is delete the first paragraph, but only if it equals the string By followed by the author’s name.  This turns out to be fairly straightforward too with XPath expressions, thus:

<xsl:template match="atom:content/xhtml:div/xhtml:p[1][. =
  concat('By ', ../../../atom:author/atom:name)]"/>

Google Blog

Now, lets take a case that is a bit more complicated.  The Google Blog is a group blog, where the content is posted by a smaller group of people who administer the weblog.  To see what I mean, take a look at the feed through Bloglines or Google Reader.

What you will see is “by Karen” or “by Molly Graham” followed by “Posted by” and another name.  My feeling is that this is exactly backwards.  The content was clearly written by the latter named person, and posted by the former.  Lets take a look at how this information appears in the feed itself:

<author><name>Karen</name></author>
<content type='html'>&lt;span class="byline-author"&gt;Posted by Adam Bosworth, Vice President&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;

The content is then normalized by Venus into:

<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><span class="byline-author">Posted by Adam Bosworth, Vice President</span><br/><br/>

With this as the starting point, we want to do three things.  First, we want to replace the author name with the byline author:

<xsl:template match="atom:entry/atom:author[../atom:content/xhtml:div/xhtml:span[@class='byline-author' and substring(.,1,10)='Posted by ']]">
  <xsl:copy>
    <atom:name>
     <xsl:value-of select="substring(../atom:content/xhtml:div/xhtml:span[@class='byline-author'],11)"/>
    </atom:name>
    <xsl:apply-templates select="*[name()!='name']"/>
  </xsl:copy>
</xsl:template>

This looks daunting, but it actually is quite straightforward.  The first line matches the author element, but only if it has a sibling content element which contains a child div element which contain a child span element which, in turn, contains a class attribute with the value of byline-author and starts with the string Posted by .  All in all, the XPath expression almost seems more readible than the English prose which attempts to explain it.

If there is a match, the author element itself is copied, then an atom:name child element is added, with the value of the byline author, starting at the eleventh character (i.e., after the Posted by string which we previously verified).

Then templates are applied for all child nodes which do not have an element name of name.

The next thing we do is remove the (now redundant) byline author, thus:

<xsl:template match="xhtml:div/xhtml:span[@class='byline-author' and substring(.,1,10)='Posted by ']"/>

Finally, we remove the two line breaks, but only if they follow the byline author:

<xsl:template match="xhtml:br[preceding-sibling::*[1][@class='byline-author' and substring(.,1,10)='Posted by ']]"/>
<xsl:template match="xhtml:br[preceding-sibling::*[2][@class='byline-author' and substring(.,1,10)='Posted by ']]"/>

conclusion

These three filters can be found at http://www.intertwingly.net/code/venus/filters/delDupName/.  The improvement they make to the feeds is actually fairly modest, but they do show some of the potential for using XSLT as filters.