Simon Fell:
What I'd like is for the aggregator to normalize the source
data (as this is something it's already worked out), so that the
plug-in doesn't have to cope with the wide range of rss versions,
modules and extensions floating around.
If you are going to normalize, my two cents would be to do
things like remove scripts, meta, embed, object tags and the like;
resolve all
relative
links; make sure that the result is well formed, etc.
Presumably you would have to normalize onto the most descriptive and verbose possibility, which would be RSS 1.0. All your other suggestions are good too.
xhtml:body is more descriptive, yet less verbose than content:encoded.
The only changes that Simon would have to make to make this rss 1.0 would be to add an rdf:about attribute and to change the namespace for the item element.
I'm doing this in Ideagraph - JTidy (port of HTML Tidy) makes the data well-formed XML, then an XSLT stylesheet converts other formats to RSS 1.0 (if the feed is RSS 1.0 to begin with, then obviously the XSLT isn't needed). This means I can maximise the app's support for modules.
A bonus is that I've tried the same approach to scrape non-RSS data, and it seems to work pretty well (though I've not coded this into the app yet).
xhtml:body as a standalone XML element has no meaning in the RSS 1.0 model (has it any in the RSS 2.0 spec?), so it is completely undescriptive!
Danny, SharpReader is an example of the value of mining the content for outbound links.
The purpose of <xhtml:body> is to try to make such semantics with the reach of more tools. See million dollar markup for related discussion.
From this perspective, it is the <content:encoded> tag that is completely undescriptive - i.e., it renders the meat of the entry as a blob.
Given this as the goal (to more fully capture and expose the semantics of the item itself), what would be the appropriate RDF serialization for this data?
Personally I suspect this draws the line on the content-metadata divide - if you need to get at the meat of the content, then it's probably not appropriate to mark it up within a (metadata) feed.
The RDF feed (or similar) provides explicit metadata data including the URL of the item, if you need implicit metadata, http get it.
Another angle on this might be to say that the content agent knows best - so for the xhtml namespace, let an xhtml-specific tool do the work (outbound links, wordcount, display,.. whatever).
I've not tried SharpReader yet, thanks for the ref.
The Content module does cover usage of unencoded XML, but does so in an overengineered way typical of early RSS/RDF development.
The simpler and now more popular content:encoded property came later.
But there's still a glimmer of solution deep in the content module: rdf:parseType="Literal". The key is that one is looking for a "property" to hang the XML off of. xhtml:body isn't a "property" so much as it just "is" html. To make it more RDF descriptive, just make up a reasonable property name and make sure it parses as not-RDF.
Cool, its gaining some interest. There's pushback on the use of XML, which i can understand, I was trying to stay with XML to run with something similar to what Sam is thinking about, How does this grab you ? interface IWeblog { string Name() ; Uri...