It’s just data

Directory Of Feed Parsers

Bob Aman: Or more accurately, a list of things which claim to parse feeds, but generally do a bloody terrible job of it

Or more accurately a bloody terrible job of rating a bunch of products without any explanation of why he rated any one the way he did (except the general comment about not parsing Atom feeds--but one has no way of knowing from the list which products that comment actually applies to). Maybe after someone has posted a question asking individually why he rated each one as he did, and he has responded to each (and errors in his analysis have been pointed out and corrected), the list will be useful.

Posted by Antone Roundy at

Update: If you read the comments, you’ll get a pretty good idea of what his rating criteria were now.

Posted by Antone Roundy at

Please note that Antone Roundy is the author of CaRP.  Which I rated as “worthless.”

Posted by Bob Aman at

Anyways, now that I’ve figured out how to do table row spanning in Textile, I’m putting up my rationale behind ratings just so other feed parser authors don’t get overly upset.

Posted by Bob Aman at

Please note that Antone Roundy is the author of CaRP.  Which I rated as “worthless.”

I foresee a significantly improved version of CaRP released in the near future.  Am I right?  ;-)

Posted by Sam Ruby at

I certainly hope so, as that is kind of the whole point.

Posted by Bob Aman at

Bob Aman: "it’s not like this is some giant competition between parsers here..."

Please note, Bob Aman has reacted similarly in the past ;)

Posted by Robert Sayre at

I foresee a significantly improved version of CaRP released in the near future.  Am I right?  ;-)

Yes, it’s in the works and will indeed be vastly improved in many ways including Atom 1.0 and 0.3 support; a proper HTTP client with support for compression, eTags, etc.; API access to the data; greater modularity to make it easier to integrate into other systems; etc. May the “near” future be as near as possible.

BTW, Bob, sorry if the comment sounded like sour grapes. Based on your criteria, maybe the current version of CaRP is “worthless”, but I’ve got plenty of users who absolutely love it, and I get lot of good use out of it on my own site--so to some people, it’s certainly a lot better than “worthless”. If you’d rated CaRP highly, I probably would have thought “cool” and not given the list a second thought. But since you didn’t, I gave it a second thought and realized that your post didn’t originally give much of an idea of what your rating criteria were. And not knowing your criteria, the list was lacking in usefulness to a reader.

The choice of words ("bloody terrible job of rating...") was more of an artistic flourish, using your own words, than an expression of the feeling behind my comment. As I said on your site, thanks for the clarification.

Posted by Antone Roundy at

Ha! Robert, Touché!

But yes, to a certain extent, I am the pot calling the kettle black.  Freely admitted.  All software sucks at some point during its evolution.  I know how it feels to have your software criticized, as Tim Bray has criticized mine.  But I also know that when I was still using PHP, I used CaRP (the free version) and I ended up switching very quickly to Magpie, which was the only decent alternative at the time.  And I can’t see any obvious improvement in the state of CaRP from what I remember of it, so, for the time being, I stand by my rating of the free version of it.  Anyone is absolutely welcome to try CaRP out and disagree with me.  (And I can see why someone might if they’re not actually a programmer, and all they’re looking for is a drop-in way of reading RSS.  But that wasn’t my goal, so it’s not reflected in the post.)  I admit I haven’t actually used all of the parsers listed, which is why I invited others to argue with my ratings.  Originally, I had simply placed a ringnaldaesque check mark next to each entry, but I found that being completely subjective and biased was more useful for my purposes.  YMMV.  The only reason the page turned into a blog post at all was so that I wouldn’t miss any parsers that had slipped through my google searching, and sure enough, there were a couple.

I still don’t consider this to be a “giant competition between parsers”, or at least, it probably shouldn’t be if somehow it has become one.  (If it were a competition, why would I be trying to get the bad parsers to catch up to the rest?)  My original reasons for making the list (which started its life simply as a private Backpack page for my own use) were stated in the comments, and those reasons have absolutely nothing to do with which parsers are better than others, but rather a subjective evaluation of how many of them play nicely with Atom, since I’m contemplating the removal of all of my RSS feeds.

I’ll be the first to point out that, despite being the more useful format, Atom is really, really tough to parse.  So it’s not surprising that there’s so few parsers that do it right.

Anyways, there’s no point to defending either the list or the parsers.  Just make everything better.

Posted by Bob Aman at

Atom is really, really tough to parse.

In which ways is that?

Posted by Aristotle Pagaltzis at

In which ways is that?

I’ll answer that: Atom actually defines the expected results.  Complete with conformance tests.

Posted by Sam Ruby at

One uh... wiki-word: XmlNamespaceConformanceTests!

Namely the third test.  If, like most parsers out there, you’re trying to expose the xhtml content as a string, you lose the context of the xml namespaces, so somehow you have to move the namespace declarations around and get them to land in the right places.  Without accidentally running afoul of html sanitization or tidy if you’re doing that kind of thing.  Which I am.

Atom links can be fun as well, especially if you’re trying to be liberal.  Given an item with 5 different link elements, which one do you pick as the One True Alternate Link if you allow for the fact that people can be stupid and may not actually tell you which one is the correct one.  For example:

<feed>
  <entry xml:base="http://example.com/articles/">
    <title>Pain And Suffering</title>
    <link href="1.html" type="text/plain" />
    <link href="./2.html" type="application/xml" rel="alternate" />
    <link href="../3.html" type="text/html" rel="alternate" />
    <link href="../4.html" />
    <link href="./5.html" type="application/xhtml+xml" />
    <link href="6.css" type="text/css" rel="stylesheet" />
    <content type="text">
      What does your parser come up with for the main link?
      What's the right value?
    </content>
  </entry>
</feed>

FeedTools, in this case, comes up with “http://example.com/3.html”.

How about?

<feed xml:base="http://example.com/articles/index.html">
  <link href="./" rel="self" type="text/html" />
  <link href="../xml/feed.atom" rel="alternate" type="application/atom+xml" />
</feed>

I actually didn’t know what FeedTools would do with that (the algorithm is kinda complicated), so I had to check.  It comes up with “http://example.com/articles/”.  I think that’s a reasonable value to come up with since it should be clear that the author of this hypothetical feed is a complete moron.

Here’s a fun one:

<feed xml:base="http://example.com/articles/../foobar/index.html">
  <link href="../../../../wackypath.html" rel="alternate" type="application/xhtml+xml" />
</feed>

FeedTools gives “http://example.com/wackypath.html” for this.

Now, admittedly, some of those aren’t valid Atom files, and if they are valid, they should probably all be producing warnings at least in the validator, though I haven’t actually checked.  But Atom gives you a lot more ways in which you can horribly break your feed.  And for a parser that attempts to make allowance for the average person’s idiocy, that can get pretty insane, pretty fast.

Posted by Bob Aman at

Bob, if each of those tests were converted into RSS 1.0 or RSS 2.0, how exactly would they be any easier to parse?

Posted by Sam Ruby at

Just so we’re clear:  The increased difficulty of parsing Atom comes with an incredible advantage as well.  Atom is simply much more expressive, so there’s more you can do with it.  Especially if your tools are cooperative.

In RSS 1.0/2.0, the first test wouldn’t have multiple links.  Simple as that.  You get one link and that’s it, so there’s no ambiguity about which link to direct the user to as the alternate link.

In the second test, again, same thing.  You get one link and that’s it.  Self links might be expressed using extensions in some cases, but in general, a self link isn’t going to be there.  Less expressive, but easier to parse.

In the third test, technically, I don’t think the spec for RSS 2.0 takes xml:base into account at all, IIRC.  So again, easier to parse, but you’re pretty much stuck with absolute urls, not relative ones.  Of course, FeedTools doesn’t care what the rules say or don’t say (such a rebel!  yikes!) and still uses xml:base if its present.

All I can tell you is that, ballpark, I’d estimate there’s about two to three times as much atom-related code in FeedTools as there is code for any of the other feed types.  It’s kinda hard to tell for sure , because everything is all mashed together (I don’t parse RSS 1.0/RSS 2.0/Atom separately), but definitely, there’s more code involved for handling Atom correctly.  But you are also probably right about the conformance tests.  If RSS had conformance tests of its own, the code for it might start ballooning pretty quick as well, as new and crazy ways of formatting RSS surfaced.  Hard to say.

Posted by Bob Aman at

“Namely the third test.”

The third test is BS. I don’t think that’s “suitable for handling as XHTML”. Well, actually, it is, but it claims not to be. It would be awesome if conformance test writers would read the spec.

Posted by Robert Sayre at

“Namely the third test.”

The third test is bogus. I don’t think that’s “suitable for handling as XHTML”. Well, actually, it is, but it claims not to be. It would be awesome if conformance test writers would read the spec.

Posted by Robert Sayre at

The Daily Grind 829

Newborns suck up all the sleep in the house and keep it for themselves. Software NUnit Code Snippets for Visual Studio 2005 - Scott Bellware provides some handy snippets for test-driven development. Encyclopodia - Someone has figured out how to put...

Excerpt from Larkware News at

The third test is bogus.

I could buy that, I suppose.  I’m planning on adding an option to FeedTools that will cause it to strip out non-xhtml stuffs and setting that as the default instead, because even after I go to all the trouble of setting up all the namespaces correctly, wouldn’t you know it, most of the browsers still end up treating the FooML as an xhtml list item even though it has a non-xhtml namespace attached.  What a pain.

Posted by Bob Aman at

In RSS 1.0/2.0, the first test wouldn’t have multiple links.  Simple as that.  You get one link and that’s it, so there’s no ambiguity about which link to direct the user to as the alternate link.

Unless of course the item also includes a [guid]-as-permalink.  The New York Times RSS feeds used to have this problem, although the feeds now include an isPermaLink=false, which solves the problem.  But if an RSS 2.0 item contains both a [link] and a [guid]-as-permalink that are different, the RSS 2.0 spec does not tell consumers which one to use as the alternate link.  Nor does it tell producers not to do this.

In the third test, technically, I don’t think the spec for RSS 2.0 takes xml:base into account at all, IIRC.

This statement is correct as far as it goes; the RSS 2.0 spec does not define any method for resolving relative links.  Nor does it warn producers not to use relative links.  If an RSS 2.0 item contains relative links, it is impossible to handle it “correctly”, because there is no definition of “correctness”.  The spec does not say that using relative links is an error on the part of the producer, nor does it tell the consumer how to handle relative links if they occur.  As a result, every producer makes different assumptions about how consumers will handle them, every consumer handles them differently, mismatched assumptions frequently produce broken links and broken images, and no one can fix the problem because no one is definitively “wrong”.

From this, you seem to draw the conclusion that Atom is “harder” because the spec actually tells both producers and consumers how to handle the issue.  This conclusion makes no sense to me.  The solution that the Atom spec describes may be difficult to implement correctly, but it is possible.  How can “possible” be harder than "impossible"?

Posted by Mark at

First off, I refuse to be an Atom zealot.  Dirty data is a fact of life.  You clean it up, curse at the author, and move on.  All that matters is the information, not the container it was placed in.  It’s nice when Atom satisfies your inner perfectionist, and its annoying when RSS ruins your day, but the ultimate goal (or at least mine, anyways) is to make the whole thing transparent to everyone, to have everyone doing their work at a layer of abstraction beyond either Atom or RSS, so in the end, it only just barely matters what format was used.

The guid-as-permalink thing is a good point.  I’d almost forgotten about it, however, in FeedTools' case, it ends up working out to like 3 lines of code:  if there’s no link found, then check the guid, if it’s a permalink and the url is valid and starts with “http”, fine, use the guid as the link instead of nil.  It may not be defined in the spec what to do with it, and that’s a pain in the neck, but a reasonably plausable course of action can still be taken.  But more importantly, you could have ignored the stupidity that is guid-as-permalink entirely and left the value as nil.  You get less information, but the behavior is still correct according to the spec since the spec didn’t specify a behavior.  And it’s easier.

How can “possible” be harder than "impossible"?

I suppose it all depends on your mindset.  I choose to think of it a little differently.  The spec might not tell me what to do with the necessary precision, but I can make a guess and my guess will make sense to the users most of the time.  From what I’ve seen, your parser tends to behave similarly.  So it’s not really “impossible”, assuming your goal was “get the information” as opposed to “get the information correctly” and “get all of the information at any cost.”

Like I said before, harder does not mean worse.  *At all.*  And I’m certainly not complaining about it being harder to parse.  I’m just pointing out that it is.  There is no doubt at all in my mind that Atom is the better format.  But RSS also has fewer elements and attributes (useful one anyways) and a spec that’s really easy to read (ie, imprecise).  It’s also more generally consistent (perhaps because nobody tries to use RSS the way Tim Bray uses Atom), so you can easily “parse” it with all kinds of terrible, terrible methods such as regular expressions that don’t take context into account whatsoever.  There’s a couple of “parsers” on my list that try to “parse” RSS by grabbing items with something like <item(| .*?)>(.*?)</item>.  Oodles of fun if CDATAs get involved somewhere.  With RSS, that trick might actually work for a significant enough percentage of feeds for the author to fail to realize that he’s a complete moron.  Not so at all with Atom.

In a way, what I’m really trying to say is that “RSS is broken” and “RSS is easy” are really very closely related statements.  But there is also nothing stopping you from producing RSS that’s unambiguous and unbroken, that can be parsed correctly by anyone (who isn’t a complete moron).

Posted by Bob Aman at

Dirty data is a fact of life.

Not to get all “do you know who I am”, but... do you know who I am?  I’m the guy who wrote the 3000 test cases for my feed parser that you used to create your feed parser.  You’re preaching to the choir about dirty data, content normalization, and all the rest.

if there’s no link found, then check the guid, if it’s a permalink and the url is valid and starts with “http”, fine, use the guid as the link instead of nil.

A fine heuristic which will get no argument from me, but it strikes me as unnecessarily complex.  Also, certain people might disagree with you...

But RSS also has fewer elements and attributes

I’ve heard this repeated many times, and it’s always struck me as odd, because it’s so obviously and verifiably false.  Not counting “channel” and “item” themselves, the RSS 2.0 specification defines 25 elements: 4 are common to both channel and item, 15 are specific to channel, and 6 are specific to item (and that’s being kind and ignoring the sub-elements of item, textInput, skipHours, and skipDays).  The Atom specification defines 16 elements: 8 are common to feed, entry, and source; 4 are common to feed and source; 4 are specific to entry (ignoring sub-elements of author).

Regardless, a straight element count (or attribute count, which I didn’t bother with) is a poor metric of complexity.  Any database analyst will tell you that the number of tables or columns in a database is irrelevant; it’s the number of “and then"s that matters.  If you need to do 5 inner joins to get the information you want, that’s not a complex query.  If you need to do a single inner join, and then do some special case if BOZOFIELD="X” and then and then conditionally join in a different table if BOZOFIELD2 != BOZOFIELD3 and then the customer name is either in BOZOFIELD4 or BOZOFIELD5 depending on the value of BOZOFIELD6... that’s complexity.  RSS is complex; Atom, with its common constructs, is designed to reduce this kind of complexity.

Posted by Mark at

There are some things I don’t like about parsing Atom. There are also some things I don’t like about parsing RSS2. There are many, many things I don’t like about parsing RSS+Extension muck. But, you know, most of the harder problems are aspects of core XML specifications.

That said, I think you’re being a little dramatic. For example, I have no idea what you mean by “you’re trying to expose the xhtml content as a string, you lose the context of the xml namespaces, so somehow you have to move the namespace declarations around and get them to land in the right places.” I place all XHTML in the default namespace (where it belongs per XHTML 1.0), and if it needs to pop out as Atom again, a hardcoded div with an xmlns declaration works great.

What are you using to parse the XHTML?

Posted by Robert Sayre at

Namely the third test.  If, like most parsers out there, you’re trying to expose the xhtml content as a string

Well, that’s the problem right there! SAX events or tree fragments would be more appropriate.

The third test is BS. I don’t think that’s “suitable for handling as XHTML”. Well, actually, it is, but it claims not to be. It would be awesome if conformance test writers would read the spec.

If an XHTML UA sees unknown elements, it must process the element’s content—that is, CSS absent just render the text content.

Even if your view is that the content in the test case is not “suitable for handling as XHTML”, it is no excuse to treat FooML elements as XHTML elements of the same local name or not to treat prefixed XHTML elements as XHTML elements.

Would you consider XHTML+MathML as not “suitable for handling as XHTML”?

Sometimes to see if UAs are bogus, you have to feed them carefully crafted BS.

Posted by Henri Sivonen at

But there is also nothing stopping you from producing RSS that’s unambiguous and unbroken, that can be parsed correctly by anyone (who isn’t a complete moron).

Speak for yourself, and the other people who have no need for a less-than character in titles.

Posted by Phil Ringnalda at

How would the non-complete-moron format the word détente in an RSS 2.0 feed (and others containing diacritical marks) so that they display properly?

Posted by Rogers Cadenhead at

Henri, skipping a “ul”  element in the default namespace is probably never what the user wants. I’m going to have to think about your MathML point, though. I think the “suitable for handling” test should be decided by the HTML renderer, and Gecko will do MathML.

I quit believing in XML Namespaces at some point and XML people think I’m nuts when I try to ignore them.

Posted by Robert Sayre at

Dirty data is a fact of life.

Not to get all “do you know who I am”, but... do you know who I am?

Heh.  I know who you are, but this is a public conversation of sorts, and I’m not writing only for you to read.  “Dirty data is a fact of life” was a statement meant for everyone’s eyes, not just yours.  It’s not like that was a private email.  I know I’m preaching to the choir with you, but the rest of the world hasn’t quite caught on apparently, judging by the state of most of the parsers on my list.

Also, certain people might disagree with you...

That is entirely inevitable.  That’s what the find_node and find_all_nodes methods are for (run arbitrary xpath, play with the node however you like).  And the links method (exposes all links within an array of simple link classes).  And the id method (grab the guid separately).  If someone disagree with my output, they can reparse that element themselves.  It’ll probably only be a line or two extra.  And there’s nothing stopping them from redefining the link method if they so choose.  It’s not really recommended since other stuff in my code calls it, but the option’s there with Ruby.  Yay for monkey patching.

But RSS also has fewer elements and attributes (useful one anyways)

But RSS also has fewer elements and attributes

You left off “(useful ones anyways)”.  RSS keeps around pointless cruft like textInput or cloud that no one ever retrieves, hence the parenthetical statement that you omitted.  And don’t forget that Atom is a lot more attribute friendly than RSS.  That is important, and I very intentionally included the word “attributes” in my statement there.  And we’re still not even taking into account the fact that many of Atom’s attributes require some level of context.  Parsing something within a context is always more difficult than if you can parse it context-free.

You have a good point about the “and thens”.  But how many of them are there really in RSS?  There’s a couple that are glaringly obvious such as guid-as-permalink, but I really don’t think the list is all that long, unless you try to include Dave Winer’s “intent” as part of the RSS spec.  Atom’s common constructs lead to some problems though.  Like, for instance, titles.  In Atom, titles are a text construct, which means they can contain HTML (unlike RSS).  But if you happen to be using, say, a GUI list box to display a list of entries, and that listbox is incapable of display HTML, then you’re forced to strip out any HTML.  Which turns out to be more difficult than it looks, especially in Ruby where we’re stuck with stupid as dirt handling of numeric entities.  The point being, that the ability to stick a huge table in your title is not necessarily a feature you want to have.  And then Atom’s embracing of namespaces (a good thing, remember) introduces it’s own set of challenges.

We could go on and on about which is more complex, what sucks and what doesn’t, what complexity actually is and so on.  What I stand by, though, is that the code for handling Atom in my parser took a significantly longer period of time to write, and makes up a noticably larger portion of the total code than the part for RSS.  I had a reasonable parser for RSS 1.0 and 2.0 and bits and pieces of Atom up and running within a matter of a few weeks, back last June, and much of that time was spent on things which affected Atom as well, like say, just dealing with HTTP.  Probably at least 80% of the time I’ve spent on it since then has been on things directly related to Atom and not RSS.  Now, mind you, some of my perceptions may be skewed here, because Atom 1.0 was released during that time period, and that required a bunch of recoding to fix some flawed assumptions, among other things, but 80% is still a really, really big chunk of time when you consider that I haven’t spent all that much time on code that isn’t FeedTools and we’re quickly approaching my parser’s first birthday.

Besides any of that, consider this:  If Atom was really as simple as you say it is, why is Snarfer the only Atom consumer that passes all of the Atom conformance tests?  Admittedly, that can’t be a fair comparison at all, because RSS doesn’t actually have conformance tests, but I think it’s still an interesting question.

Posted by Bob Aman at

“But if you happen to be using, say, a GUI list box to display a list of entries, and that listbox is incapable of display HTML, then you’re forced to strip out any HTML.”

Bah! Nonsense. If you have to display an RSS item that doesn’t have a title in one of those, you need to excerpt the description and strip the HTML. For example, the first part of Dave Winer’s description elements are often little thumbnails of internet conference personalities such as Sam Ruby and Marc Canter. I fixed the bug in Thunderbird, which didn’t even mention Atom.

Posted by Robert Sayre at

Rogers:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>D&#233;tente?</title>
    <description>What's so difficult about utf-8?</description>
    <link>http://nowhere.com/</link>
    <item>
      <title>Maybe I'm missing something (very possible)...</title>
      <description>
        But I don't really see why you need to put &amp;eacute; in your title
        when the utf-8 character or a numeric entity will suffice.
      </description>
      <guid isPermaLink="false">
        urn:uuid:76845e80-a982-11da-993c-00080286226b
      </guid>
    </item>
  </channel>
</rss>


Posted by Bob Aman at

Bob: try subscribing to Phil.

Posted by Sam Ruby at

Man, if I reply to all of my unexpected detractors, I’m going to run affoul of Sam’s spam throttle again.  And never accomplish anything else with my life.

Henri:

Namely the third test.  If, like most parsers out there, you’re trying to expose the xhtml content as a string

Well, that’s the problem right there! SAX events or tree fragments would be more appropriate.

Not gonna argue this one, but I couldn’t disagree more.  The user of the parser shouldn’t have to care whether someone made their content available as embedded xhtml, encoded html, plain text, or base64 something or other.

Phil:

But there is also nothing stopping you from producing RSS that’s unambiguous and unbroken, that can be parsed correctly by anyone (who isn’t a complete moron).

Speak for yourself, and the other people who have no need for a less-than character in titles.

Says the guy who uses them purely to break people’s feed readers.  :-)

Fair enough.  The spec doesn’t say one way or the other on titles, so I suppose my statement was overreaching a bit.  Perhaps a lot.  But my understanding was that it was not Dave Winer’s intention for the title element to contain HTML.  As such, I chose to have FeedTools treat title elements in RSS as plain text, just like Atom does by default.  Of course, as always, I could be horribly mistaken.  It happens.

Posted by Bob Aman at

At first glance, that’s good advice, Bob. Your test feed works in both Internet Explorer 7 and Bloglines. Everyone using entity codes in elements to represent characters ought to use numeric codes instead.

Somebody ought to put that in the, uh, spec.

Posted by Rogers Cadenhead at

Sam:

I do subscribe to Phil and have for um, ages.  He breaks NetNewsWire quite handily.

FeedTools currently normalizes titles to html (despite my reservations about html in titles, I leave it up to others to strip it out if they need to), and as such, I believe that

feed = FeedTools::Feed.open("http://weblog.philringnalda.com/feed/")
feed.entries[0].title
=> "&lt;Unwelcome&gt;"
feed.entries[1].title
=> "&lt;Boy, this nest sure is roomy&gt;"

this is correct.

Posted by Bob Aman at

Bob: Bloglines and Yahoo disagree with you, both in different ways.

Posted by Mark at

But my understanding was that it was not Dave Winer’s intention for the title element to contain HTML.  As such, I chose to have FeedTools treat title elements in RSS as plain text, just like Atom does by default.  Of course, as always, I could be horribly mistaken.  It happens.

Exhibit A.
Exhibit B.

But the real question is can you tell these two apart?

Posted by Sam Ruby at

Man, if I reply to all of my unexpected detractors, I’m going to ... never accomplish anything else with my life.

You must be new here.

Posted by Mark at

Mark:

Sounds to me like that makes Bloglines and Yahoo broken.  But like I said in the feed description, you could always just use straight up utf-8.

You must be new here.

I’d laugh if it wasn’t so painfully true.

Rogers:

If it works, it works because I didn’t double-escape in order to avoid the ambiguity of whether or not titles can contain HTML.  The numeric entity was only because it’s now hanging out at the XML level instead of the HTML level.

Sam:

Ahhh yes, Dave contradicting himself again.  I love it when he does that.  I’m still inclined to go with RSS titles being plain text, if for no other reason than that it matches the behavior of Atom when the type attribute is omitted.  Perhaps also because Dave seemed more insistent in Exhibit A.  :-)

Posted by Bob Aman at

you could always just use straight up utf-8.

Been there, done that.

Posted by Sam Ruby at

Bob:

Says the guy who uses them purely to break people’s feed readers.  :-)

Well, except when he’s trying to make feeds for the Mozilla Bugzilla. (Of course, you know that.)

Posted by Aristotle Pagaltzis at

Bob Aman:

even after I go to all the trouble of setting up all the namespaces correctly, wouldn’t you know it, most of the browsers still end up treating the FooML as an xhtml list item even though it has a non-xhtml namespace attached.  What a pain.

What most browsers? Firefox, Opera and Safari get the standalone XHTML version of the test (which is what an Atom reader should pass to a browser engine) right. Let me guess: You are using the text/html code path instead of the XML code path.

The user of the parser shouldn’t have to care whether someone made their content available as embedded xhtml, encoded html, plain text, or base64 something or other.

I agree. Therefore, the Atom library should give XHTML using a proper XML API (SAX events, pull parser stub or tree fragment) to the app. If you pick only one format you pass on, it makes no sense to pick HTML over XHTML, because converting e.g. XHTML+MathML to HTML leads to data loss.

Robert Sayre:

Henri, skipping a “ul”  element in the default namespace is probably never what the user wants.

I thought Atom was all about specs (including normative references) and not about presenting conjectures about what the user supposedly wants.

I’m going to have to think about your MathML point, though. I think the “suitable for handling” test should be decided by the HTML renderer, and Gecko will do MathML.

And those who use WebKit should consider embedded SVG.

I quit believing in XML Namespaces at some point and XML people think I’m nuts when I try to ignore them.

I don’t like them, either, but we are now stuck with them.

Posted by Henri Sivonen at

“I thought Atom was all about specs”

It is. The test violates a SHOULD. Clients will do many things similar to what I described, and they’ll be able to point at the SHOULD as their excuse. That’s why SHOULD sucks. :)

Posted by Robert Sayre at

Aristotle:

Of course, you know that.

Of course.  Hence the emoticon to indicate humor.  :-)  Like I said, I’ve been subscribed to Phil for ages.

Henri:

Let me guess: You are using the text/html code path instead of the XML code path.

Yes and no.  I’ve only written one program using FeedTools, thus far, that does the whole web feed reading thing, and that was the tutorial code for how to set up FeedTools with Rails.  It used application/xhtml+xml by default and obviously, it works in most everything but Internet Explorer there.  However, I also wrote a relay of sorts for cleaning up feeds, and if it is pointed at your test feed, it supplies this as the content:

<content type="html">
  &lt;ul&gt;
    &lt;li&gt;This is an XHTML list item. If it is not rendered as a list item, the namespace support of the client app is broken.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;ul xmlns='http://hsivonen.iki.fi/FooML'&gt;
    &lt;li&gt;This is not an XHTML list item. If it is rendered as a list item, the namespace support of the client app is broken.&lt;/li&gt;
  &lt;/ul&gt;
</content>

If you put that in various feed readers, they will all handle it differently, to no one’s surprise.  And sometimes they just pass it on through unchanged, in which case the browsers then get to handle it all differently, depending on whether application/xhtml+xml was the content type.  I have no way of demanding that they render it as application/xhtml+xml and, as I’m sure you’re well aware, judging by the fact that you wrote those test cases, there’s more clients that break on xhtml embedded in Atom than on encoded html, which is why, for the moment, I tend to prefer encoding it.

I do, however, have a curiosity question, and honestly, I don’t know the answer.  What is the correct, expected rendering behavior for unknown markup embedded within XHTML?  Should a browser just ignore it and its content entirely?  Or display the contents, but as if they were not within any markup at all (that’s what Firefox appears to be doing with your stand-alone test), or...?  Because I’m considering forcing a prefix to unknown markup so that the not so excellent consumers will be less likely to confuse things like FooML for XHTML or alternately (with a config option), stripping all unknown markup entirely.

Posted by Bob Aman at

<grumble>

NetNewsWire handles Sam’s comments feed wrong and displays those &lt;/&gt;'s in my previous comment incorrectly.  Looks like it unescapes them for some reason, but leaves the surrounding <content type="html"> element alone.

Posted by Bob Aman at

Robert Sayre:

The test violates a SHOULD. Clients will do many things similar to what I described, and they’ll be able to point at the SHOULD as their excuse.

No they cannot. Even if the test case was non-conforming, it does not give UAs a license to do whatever they please. In particular, they are not allowed to violate Namespaces in XML and XHTML 1.0. (Namespaces in XML does not allow a UA to guess the namespace or change the namespace, and XHTML 1.0 says the UA must process the content (i.e. render descendant character data) of unknown elements.)

Bob Aman:

However, I also wrote a relay of sorts for cleaning up feeds, and if it is pointed at your test feed, it supplies this as the content

Since dataloss is inevitable, the more correct way would be to ignore the unknown elements at the conversion level. Like this:

<content type="html">
  &lt;ul&gt;
    &lt;li&gt;This is an XHTML list item. If it is not rendered as a list item, the namespace support of the client app is broken.&lt;/li&gt;
  &lt;/ul&gt;

  
    This is not an XHTML list item. If it is rendered as a list item, the namespace support of the client app is broken.
  
</content>

What is the correct, expected rendering behavior for unknown markup embedded within XHTML? 

The XHTML 1.0 spec says: “If a user agent encounters an element it does not recognize, it must process the element’s content.”

Should a browser just ignore it and its content entirely?

No.

Or display the contents, but as if they were not within any markup at all (that’s what Firefox appears to be doing with your stand-alone test), or...?

Yes. The unknown markup is parsed into the DOM, but since there are no applicable UA style sheet rules, the default display: inline; applies.

Because I’m considering forcing a prefix to unknown markup so that the not so excellent consumers will be less likely to confuse things like FooML for XHTML or alternately (with a config option), stripping all unknown markup entirely.

I think stripping it when converting to HTML is the best thing to do.

Posted by Henri Sivonen at

“XHTML 1.0 says the UA must process”

Actually, RFC4287 references XHTML Modularization, and the User-Agent conformance rules apply only XHTML Family Document types, which Atom isn’t. Thanks W3C HTML Activity! What is the WHAT-WG behavior here? I’ll follow whatever they say to do.

Posted by Robert Sayre at

Well! I am shocked to report that a similar XHTML test case works in IE and Mozilla!

Posted by Robert Sayre at

My head hurts.

Henri:

Despite what you say, I have a hard time believing that stripping the tags from the content is the best course of action.  Especially in my case, because internally, its often highly ambiguous whether I’m dealing with XHTML or HTML or HTML that’s been tidied into XHTML.  Plus I’m loathe to just lose information unless there’s a really good justification for it.

Posted by Bob Aman at

Robert Sayre:

Actually, RFC4287 references XHTML Modularization, and the User-Agent conformance rules apply only XHTML Family Document types, which Atom isn’t. Thanks W3C HTML Activity!

The W3C HTML WG likes to talk about things like “Strictly Conforming” documents implying that there is non-strict conformance (presumably DOCTYPEless mixing of namespaces), but they don’t define what non-strict conformance is.

What is the WHAT-WG behavior here? I’ll follow whatever they say to do.

“The rules for parsing XHTML documents into DOM trees are covered by the XML and Namespaces in XML specifications, and are out of scope of this specification.” (Source) Rendering is left to CSS, which means that the WHAT WG behavior is what I described earlier: The unknown markup is parsed into the DOM, but since there are no applicable UA style sheet rules, the default display: inline; applies. (And most other properties inherit by default.)

Bob Aman:

Despite what you say, I have a hard time believing that stripping the tags from the content is the best course of action.  Especially in my case, because internally, its often highly ambiguous whether I’m dealing with XHTML or HTML or HTML that’s been tidied into XHTML.  Plus I’m loathe to just lose information unless there’s a really good justification for it.

HTML has no notion of namespaces. If you copy an element from a non-XHTML namespace into HTML output without a prefix and the local name overlaps with an HTML element name, for both theoretical and practical purposes you’ve changed the meaning of the document in a way not warranted by any spec.

If you copy prefixed stuff in there, you are putting stuff in HTML that is not conforming there. Surely conversion to a target language should not include stuff that is not conforming in the target language.

If you drop unknowns replacing them with their children, there is dataloss, but the dataloss is necessary for XHTML to HTML conversion.

If you want to avoid dataloss and want to expose only one format, it makes sense to convert HTML to XHTML instead (using a proper text/html parser—not using source level replacements).

Posted by Henri Sivonen at

Henri:

Under the circumstances, I think the only viable option for me is to allow the programmer to choose with a configuration option.  I write all my web apps with XHTML, but someone else might not have Tidy installed (tidy support is optional) and might display stuff with HTML 4.01, and I’ve got no way of knowing for sure.  So the only good solution is to let the user pick and choose, and default to the most preferable method, which is XHTML everything.

Posted by Bob Aman at

Add your comment