It’s just data

Escaped Markup Considered Harmful

Norman Walsh: There is clear evidence that the escaped markup design will spread if it isn't checked. If it spreads far enough before it's caught, it will become legacy. Some vendors will be forced to continue to support this abomination by simple economics. And it won't be their fault, it'll be ours for not killing the virus before it could spread.

While I am certainly sympathetic to that view, my current leanings are simply that any such escaping needs to be clearly identified as such.


<created> is also a user-entered field.  Maybe we should allow a <created mode="escaped"> so producers can signal that they didn't perform input checking on that field, too.

Posted by Ken MacLeod at

RE: Escaped Markup Considered Harmful

It's easy for folks like Norm to sit in their ivory tower and complain about escaped markup in content but just like the typical ivory tower rant  he complains with proposing a solution.

Escaped markup in content works. I dislike the fact that it has to be done but given the paucity of tools that support generating valid XHTML either we say people shouldn't put markup in content or we suck it up and learn to deal with it.

PS: Testing the "Sent Items" folder on RSS Bandit. Wish me luck.

Message from Dare Obasanjo at

I'm curious. In the various arguments against escaped markup, what have the various people against it proposed as an alternative?

PS: http://www.25hoursaday.com/rssbandit_sent.gif - Yeah Baby, it worked.

Posted by Dare Obasanjo at

Dare, I presume you meant "withOUT proposing a solution".

It looks to me like he DID propose a solution: base64 encoding.

My guess is that Norm is just expressing a frustration not unlike the one that you recently expressed in a slightly different, but related problem domain.

As I said, my leanings aren't as radical as Norman's appear to be.  My biggest issue with escaped content is that it needs a way to be marked as such.

What is the proper way to express a title of <title>?

Posted by Sam Ruby at

Do we need more than one freely available, cross-platform tool (HTML Tidy) to do input markup validation?

Posted by Ken MacLeod at

RE: Escaped Markup Considered Harmful

Sam,
The main problem with Norm's complaint is that Norm's suggest asks the producers and consumers to do additional work (encoding and decoding HTML content to and from base64) for zero gain. He complains that sticking escaped markup in an element subverts the intention of the schema author who expected you to put text in there.

Guess what? To most RSS aggregators the content of description/dc:description/content:encoded and the like is just a string that is passed to the Web browser for rendering.

Message from Dare Obasanjo at

Can anyone explain to me how he can be against escaped markup, but not against base64 encoding?  I really have no idea what the benefit of one over the other could possibly be.  Both take a string that could look a lot like XML, well-formed or otherwise, and turn it into a form that will slip through parsers without harm.

Posted by Patrick Lioi at

Everything considered harmful

Perhaps the things we consider harmful say more about us than they do about them.... [more]

Trackback from dive into mark

at

Dare, I don't know about most aggregators, but NewsGator, SharpReader, and NetNewsWire at a minimum attempt to resolve relative URLs.

How long will it be before the practice of simply passing a string to the web browser is recognized as a Trustworthy Computing issue?

Posted by Sam Ruby at

What's this "no benefits" stuff?  If you guarantee that your feed is well-formed, then when someone provides opinions or information in a non-Latin language, they will have a chance of getting through un-squashed by incompetent or bigoted software, and a good chance of being rendered correctly by modern browsers.

If not, not.

That may not seem like a benefit to you, but it does to me.

Posted by Tim Bray at

What's this "no benefits" stuff?  If you guarantee that your feed is well-formed, then when someone provides opinions or information in a non-Latin language, they will have a chance of getting through un-squashed by incompetent or bigoted software, and a good chance of being rendered correctly by modern browsers.

Feeds with escaped markup are well-formed.
<blockquote>
I don't know about most aggregators, but NewsGator, SharpReader, and NetNewsWire at a minimum attempt to resolve relative URLs.

How long will it be before the practice of simply passing a string to the web browser is recognized as a Trustworthy Computing issue?
</blockquote>
It is definitely easier to process embedded markup if it is XML than if it is tag soup. In RSS Bandit, I always convert embedded markup in feeds to XML so I can resolve relative URLs, strip potentially malicious tags and compile lists of outgoing links. However I am not under the assumption that it is easy for content producers to emit well-formed markup especially when this markup may have been entered by users.

Posted by Dare Obasanjo at

"Feeds with escaped markup are well-formed. "

Yeah, except that they don't have the MIME headers that they came with  originally so you don't know the character encoding so god help the browser when you hand it text containing non-ASCII and can't tell it what the encoding is.  If the content is mode="xml" you don't have this problem.

Mind you, I'm not arguing for removing the mode="escaped" option, I just want to make it clear that it's a poor second-rate alternative.  And I am arguing for ruthlessly tossing non-WF feeds on the floor.

Posted by Tim Bray at

Tim: Uh, but base64-encoded markup does have the mime-headers that the markup came with originally?

Anyway, so just add a mimetype attribute which is required in mode="escaped".. everybody happy now?

Posted by Tomas at

Sam - Did you mean to refer to Norm as "Normal" Walsh?  I think he goes by Norman.  Or is that a pun on normalized?

Posted by Simon St.Laurent at


Yeah, except that they don't have the MIME headers that they came with  originally so you don't know the character encoding so god help the browser when you hand it text containing non-ASCII and can't tell it what the encoding is.  If the content is mode="xml" you don't have this problem.

Not a fat load of good that does me given that I can't have an XML document with different encodings. So either way I either

a.) Use one encoding in my document and feed that to the browser

or

b.) Have some attribute on the content that describes the encoding then feed that to the browser.

Neither of these is any different whether content is escaped markup or XML.

Posted by Dare Obasanjo at

Simon: Just a typo.  Corrected.  Thanks!

Posted by Sam Ruby at

Obviously I'm not making this simple enough.  To quote Larry Wall, "an XML document knows what encoding it's in."  That is to say, either (a) the document is broken or (b) you know unambiguously exactly what Unicode code-point each character is.  Weirdly enough, i18n doesn't appear in the list of design goals for XML, but turns out to have been one of the most important practical contributions.

Ask someone who lives every day in Japanese or Russian or Greek how often they have to manually tell a browser what encoding a page is in (regularly) and how often it still doesn't get it right (less often, but too often).

There are a lot of steps in the chain between initial input and eventual rendering for a human, particularly when there's syndication involved.  For each and every one of those steps which is XML, you can be confident that non-ASCII characters are not going to get mashed.  Otherwise you're going to have to be careful to pass along just the right metadata and still be sure that things will go wrong too often, of course that will only bother those foreigners in unimportant parts of the world so who cares.

It just seems wrong not to acknowledge that if your data is unescaped XML, it's a better citizen of the world, and that people who persist in escaping stuff to dodge well-formedness checking are with a high degree of likelihood arranging some severe i18n pain for themselves in the future. 

Further, that parsers being "liberal" (i.e. trying to guess broken pseudo-XML is trying to say) means that once they get outside the domain of ASCII means they are actually quite likely being "wrong".

Posted by Tim Bray at

Dare Obasanjo

Tim,
I'm not sure what post your are replying to but I assume it isn't mine since as I pointed out there is no difference between escaped markup and XML when it comes to passing the correct encoding of the content to the web browser.

I'd be very interested to see you describe a scenario where XML in content provides encoding information in a better way than could be done with escaped markup in content.

I posit that whatever mechanism you use would work the same for both scenarios.

Message from Dare Obasanjo at

Norm's article = "if you're going to do bad, sweep it under the carpet where I can't see it"

Just because it's hard to ensure user-entered markup is valid XML, that doesn't mean we're too stupid to encode it properly, thank you very much. They're separate issues. Escaped content in a feed is UTF-8 with a few of its characters escaped. Alright, you need to convert everything to UTF-8 first, but you have to do that with inline content too. Like Dare said, what's the difference? (also base64 content with have no declared or inherited character encoding so how is it an improvement?)

I also think the security risk with RSS is minimal. People choose which feeds to subscribe to, but they'll click any link from Google, so how is it more risky? Obviously public sites that put RSS content into their pages have to be careful, but for private consumption it's no problem.

Posted by Graham Parks at

OK, thinking out loud here.  I just grabbed a page from a large commercial European site (fnac.com), it turns out to be encoded in windows-1252.  Thus the apostrophes are encoded as 0x92, 0222.  Of course U+0092 is the C1 escape "PU2" whatever the hell that is, nothing like an apostrophe.

So... when someone jams this into <content mode="escaped">, what happens?

Posted by Tim Bray at

They should either convert to Unicode, as well-formed XML dictates, or declare they're using the Windows character set in the XML processing directive.

(Pedantry: As UTF-8, 0x92 isn't U+0092, as 0x80 to 0xBF represent a continuing byte. So it's just a disconnected 6 bits from somewhere the middle of a multibyte character. There'd have to somehow be a 0b110xxxxx preceding it for parsing not to fail outright)

Posted by Graham Parks at

Right.

If they just jam the 0x92 byte into the <content mode='escaped'> it will fail because as Graham points out it's not UTF-8, unless they're smart enough to put the code page in the XML declaration and the receiving parser happens to handle that and all their other non-ASCII is in that code page, or unless someone uses a "liberal" parser in which case when they hand it to a browser, it won't render correctly.

Alternatively, they could base64 it, in which case when they decode it and hand it to a browser, it won't render correctly.

And if they're going to go to the trouble of actually figuring out what the right Unicode character is, then they might as bloody well just buckle down and do the rest of the XML work, which is generally easier than this Unicode stuff. 

Put another way, if there's to be any hope of reliable internationalized operation, neither mode="escaped" nor mode="base64" are going to help you.

And anyone who in A.D. 2003 ships code that is not internationalized is on morally shaky ground as well as being a lousy businessperson.

Posted by Tim Bray at

Graham said


They should either convert to Unicode, as well-formed XML dictates, or declare they're using the Windows character set in the XML processing directive.


which I agree with. Tim said


Put another way, if there's to be any hope of reliable internationalized operation, neither mode="escaped" nor mode="base64" are going to help you.


but neither does mode="xml". You still end up having to do what Graham said anyway. I'm still waiting for you to show how XML is any better than escaped content. All you've shown is that escaped content and XML face the same set of problems in this case.

Posted by Dare Obasanjo at

How is fishing through HTML tag soup to create something that resembles XML easier than doing a clearly defined transform? Or in PHP:

???? vs echo utf8_encode(htmlspecialchars($htmlcontent));

The tag soup part belongs in HTML renderers. I think insisting on XHTML would be the quickest way to kill the format, as almost everyone will be using escaped mode.

Posted by Graham Parks at

Dare, my gut feeling is that a feed-producing tool which can't guarantee that it is creating well-formed XML (which is essentially the mode="escaped" use case, isn't it?) can't be trusted to convert some Windows codepage to UTF-8 either.

That gut feeling also tells me that if you're using XML-aware tools to produce your feed -- whether that be building a DOM and serializing it, or spitting SAX events into a Cocoon pipeline, or using XSLT to transform some application-specific XML into the Feed format -- you're building your feed producer on a base of libraries that already get Unicode right.

That's why I would expect mode="encoded" to be more likely to contain character encoding errors than mode="xml". This is only true if, as I have assumed, people use encoded mode when they are doing ad-hoc print statements to produce the feed XML, and use inline when they know they're using reliable tools that get character encoding and XML right.

Posted by Adam Fitzpatrick at

That gut-feeling doesn't make sense. The non-XML here is HTML entered by end users and has nothing to do with the tool or its writers. I don't think there's a valid argument that tool vendors who prefer not to attempt to create XHTML from user-entered markup must therefore know nothing about XML.

Posted by Graham Parks at

That's not what I'm saying.

If you (as a tool developer) treat entry content as "just a string", the burden is generally on you to get the character encoding issues right. If you do ensure that your content is XML (whether that be by using something like tidy or rudely barking an error message when the user has the temerity to hit "Submit" without making sure it's well-formed XML first), you can make encoding issues largely Someone Else's Problem by letting your XML libraries deal with it.

Again using Cocoon as an example, suppose you're generating a P/E/A/W feed by transforming some internal XML representation of a collection of entries. Your transformer reads from and sends out char arrays. You don't care what encoding the entries were written in, and you don't care what encoding will ultimately be sent out in the response, and you don't have to deal with escaping the characters XML treats specially. You just read and write sequences of characters and somebody else gets the messy stuff right for you.

Posted by Adam Fitzpatrick at

What can XML libraries do that utf8_encode(htmlspecialchars()) can't?

Posted by Graham Parks at

Lots of gut feels.  Let's inject some facts:

feed1.  Counter example to "almost everyone will be using escaped mode".

feed2.  Counter example to the assertion that there is anybody "insisting on XHTML".  Also a counter example to any perception that correct encoding can't be done with escaping.

Separate from these facts, there is some hope that simultaneously making escaping explicit and providing an option to inline the markup directly will encourage people to actually think about these issues instead of blindly applying strcat.  However, these are merely hopes at this point.  This is not an assertion that can be conclusively proven either way.

If this does result in more well formed XML feeds, then there clearly is a benefit.  Anybody out there care to quantify the cost so that a proper cost/benefit analysis can be done?

Posted by Sam Ruby at

I've been thinking about this a lot over the last two days, and the thing I've realised is that Norm's argument only stands up if we were using escaping to embed XML within XML. But we're not, we're embedding HTML, which is just data that sorta kinda resembles it some of the time. He tries to sidestep that by stating as fact people shouldn't even be using tools that don't produce XHTML (is he calling me a tool?), as if that's a debate that's been won.

And if you start thinking down that path, you quickly come to the conclusion that Norm's real beef with RSS and Atom is that they aren't being used to force everyone to use XHTML. His real message is "How dare people use XML without putting all their weight behind my XHTML agenda". Which is just bullshit.

Posted by Graham Parks at

Graham, although an XHTML profile is an obvious choice for an arbitrary inline markup language to be used in Atom, whether or not that markup language is XHTML is a red herring.  It could (tho not suggesting) be an Atom specific, safely transported, well specified, minimal markup language.

The question is more simply: is there a broader range of benefit to using an XML-parseable markup language in Atom properties than in using an opaque string that can only be handed off to HTML browsers or widgets to properly parse and display?

The question applies not only to <content>, but also potentially to <title>, <summary>, <tagline>, and other properties that allow displayed text.

There is a very strong argument for <title>, <summary>, and <tagline> to be only plain text because it shouldn't need or can't easily to be parsed as arbitrary HTML.  Whereas with inline XML, it's as easy as_string() or accumulating character data to ignore the markup if necessary.

Posted by Ken MacLeod at

Why can't summaries have markup?  Example1 and Example 2.  I've heard requirements for bold and italics in titles.

Posted by Sam Ruby at

Good grief over escapes in XML

There is also a good discussion of Norman's article on Sam Ruby's blog in a post titled Escaped Markup Considered Harmful. In particular, the back-and-forth between especially Tim Bray and Dare Obasanjo on character encoding issues in escaped content.... [more]

Trackback from the iCite net development blog

at

Escaped Markup Must Stop

Today I worked on a Atom feed for the developer version of KAYWA by using the &#60;content type="application/xhtml+xml" mode="xml" xml:lang="en"&#62; (inspired by Mark Pilgrim's atom 0.2 maximal. Inside of it there is only pure aka ...

Pingback from Bitflux Blog :: Escaped Markup Must Stop

at

Escaped Markup Could Stop Now, If...

Today I worked on a Atom feed for the developer version of KAYWA by using the <content type="application/xhtml+xml" mode="xml" xml:lang="en"> (inspired by Mark Pilgrim's atom 0.2 maximal. Inside of it there is only pure aka wellformed XHTML....

Excerpt from Bitflux Blog at

comment on Escaped Markup Considered Harmful

The question is more simply: is there a broader range of benefit to using an XML-parseable markup language in Atom properties than in using an opaque string that can only be handed off to HTML browsers or widgets to properly parse and display?...

Excerpt from Ken MacLeod at

comment added

I found some more background information on this topic. Actually, the RSS 2.0 spec suggests that you put HTML in the <content> element by escaping it... That’s quite controversial: Norman Walsh’s Escaped Markup Considered Harmful Sam Ruby’s...

Excerpt from The Trac Project: Ticket #2580: RSS feed validation in 0.9.3 at

Add your comment