It’s just data

Normalize this?

Simon Fell: What I'd like is for the aggregator to normalize the source data (as this is something it's already worked out), so that the plug-in doesn't have to cope with the wide range of rss versions, modules and extensions floating around.

If you are going to normalize, my two cents would be to do things like remove scripts, meta, embed, object tags and the like; resolve all relative links; make sure that the result is well formed, etc.


Presumably you would have to normalize onto the most descriptive and verbose possibility, which would be RSS 1.0.  All your other suggestions are good too.

Posted by Mark at

descriptive and verbose?

xhtml:body is more descriptive, yet less verbose than content:encoded.

The only changes that Simon would have to make to make this rss 1.0 would be to add an rdf:about attribute and to change the namespace for the item element.

Posted by Sam Ruby at

I'm doing this in Ideagraph - JTidy (port of HTML Tidy) makes the data well-formed XML, then an XSLT stylesheet converts other formats to RSS 1.0 (if the feed is RSS 1.0 to begin with, then obviously the XSLT isn't needed). This means I can maximise the app's support for modules.

A bonus is that I've tried the same approach to scrape non-RSS data, and it seems to work pretty well (though I've not coded this into the app yet).

xhtml:body as a standalone XML element has no meaning in the RSS 1.0 model (has it any in the RSS 2.0 spec?), so it is completely undescriptive!

Posted by Danny at

Danny, SharpReader is an example of the value of mining the content for outbound links.

The purpose of <xhtml:body> is to try to make such semantics with the reach of more tools.  See million dollar markup for related discussion.

From this perspective, it is the <content:encoded> tag that is completely undescriptive - i.e., it renders the meat of the entry as a blob.

Given this as the goal (to more fully capture and expose the semantics of the item itself), what would be the appropriate RDF serialization for this data?

Posted by Sam Ruby at

Personally I suspect this draws the line on the content-metadata divide - if you need to get at the meat of the content, then it's probably not appropriate to mark it up within a (metadata) feed.

The RDF feed (or similar) provides explicit metadata data including the URL of the item, if you need implicit metadata, http get it.

Another angle on this might be to say that the content agent knows best - so for the xhtml namespace, let an xhtml-specific tool do the work (outbound links, wordcount, display,.. whatever). 

I've not tried SharpReader yet, thanks for the ref.

Posted by Danny at

ps. check out the first <description> element in this feed:

http://www.neward.net/ted/weblog/rss.jsp

Posted by Danny at

That's a very fuzzy line.

In my mind, the association between my item and the other items it references is perhaps the most important piece of metadata that I can imagine.

The beauty of XML (and much of the promise of RDF) is to enable one to eliminate the need for domain specific tools.

Writing full a XHTML parser is a big endeavor.  Picking out <xhtml:a> elements from well formed XML is a considerably easier task.

Again, if you have any suggestions on how to make something akin to <xhtml:body> more RDF descriptive, I will comply.

Posted by Sam Ruby at

The Content module does cover usage of unencoded XML, but does so in an overengineered way typical of early RSS/RDF development.

The simpler and now more popular content:encoded property came later.

But there's still a glimmer of solution deep in the content module: rdf:parseType="Literal".  The key is that one is looking for a "property" to hang the XML off of.  xhtml:body isn't a "property" so much as it just "is" html.  To make it more RDF descriptive, just make up a reasonable property name and make sure it parses as not-RDF.

Maybe <content:unencoded rdf:parseType="Literal">then <xhtml:em>some</xhtml:em> html</content:unencoded>.

'unencoded' isn't a good name, but I'm not good at names.  Maybe a convention on whether the content should be enclosed in a span or div.

Posted by Ken MacLeod at

BlogThis

Cool, its gaining some interest. There's pushback on the use of XML, which i can understand, I was trying to stay with XML to run with something similar to what Sam is thinking about, How does this grab you ? interface IWeblog { string Name() ; Uri...

Excerpt from Simon Fell at

Add your comment