UserPreferences

EscapedHtmlDiscussion


Proposals

A. Escape everything.

Votes: AaronSw, BrentSimmons, GarrettRooney, DeveloperDude, BenAdida, LeonardoHerrera

[BrentSimmons, RefactorOk] I don't like in-line: I prefer escaping or CDATA.

Reasons:

1. The chunk of data in between <content> tags really is a chunk of data. I can't think of any earthly reason why the parser wants to deal with it as anything but a single chunk. It may be that you'd want to parse it later, somewhere else in your app, for some reason, but not when you're building an array of weblog entries. You just want the string.

2. It needs to be well-formed, and that's never going to happen. Not now and not in five years. It may happen 99% of the time, as publishing tools get good at it, but anyone parsing this stuff is always going to have deal with non-well-formed content. It's going to be a huge headache. That's already a problem with RSS, but this will make it worse. (The percentage of non-well-formed feeds will go way up.)

In fact, what I'd probably do is pre-process the feed, wrap inline stuff as CDATA, before passing it to an XML parser. This is a stunningly ugly thing to have to do.

But I'd do it for both reasons #1 and #2: because I want the content as a string, not as a tree, and because I want to deal with non-well-formedness.

[GarrettRooney RefactorOk] Given what BrentSimmons says about in-line being hell for consumers of the data, and the annoyance value of requiring the content to be escaped, it seems like CDATA sections are the lesser evil here. I don't particularly love the idea of wrapping all my content in CDATA, but I like it a lot more than the ideas of reading escaped HTML in one case, or trying to require correct markup and then dealing with the fact that there will always be incorrect markup out there in the other.

[BenAdida] We should learn from the End-to-End design principle of Internet protocols: keep the format simple, don't discriminate between potential uses. Specifically, an Echo reader program's job should be to parse Echo stuff (hopefully assuming very little other than a super-simple Echo XML schema) and pass the payload up to the calling application. The only way to stay simple is to have one implementation method, and the only way to be content-neutral is to quote everything. Anything else means we think we know exactly how Echo will be used from now until the end of time, and that's just ensuring we will miss opportunities to build new, interesting things in the future. We must be prepared for what we haven't yet invented.

[LeonardoHerrera] I want the ability of write a basic non-echo reader without deep thinking. I'm a lazy man, give me a CDATA element, I'll extract the contents and throw it to a display window. Give me encoding and type attributes, and I'm all set: <content type="holograph/animated-no-artifacts" encoding="bork-bork2"> will be ignored by my parser, putting a placeholder instead (something like this content cannot be displayed by this little program.)

B. Support inline, escaped, and base64:

Note: base64 another content encoding, but is intended for use by primarily binary, non-*ML media types.

Votes: MarkPilgrim, JoeGregorio, JeremyGray, SamRuby, TimothyAppnel, DareObasanjo, ChrisWilper, ArveBersvendsen, TomasJogin, DiegoDoval, TimBray, KenMacLeod, DaveWarnock, UcheOgbuji, LachlanCannon

[MishaDynin, RefactorOk] This is great for xhtml. What is the default mode for text/plain? "xml" doesn't make sense, and default mode shouldn't depend on type.

C. B, but with mode="xml" to be required

D. No, I want something else entirely


Considerations


Discussions Elsewhere

See also content, [WWW]this thread, MimeContent.

Should HTML content be escaped (nee quoted) or inline?

Discussion

Proposal AaronSw: Everything is escaped.

Commentary Bray: (at greater length below.)

[TimBray] Sam and I are converging: I think he's captured the three interesting cases, but I'm not 100% that elements are the right level. The following feel a bit more idiomatic to me, but now I'm off to sleep on it:

Withdrawn Proposals

Proposal SamRuby: withdrawn ([WWW]See this thread) There are three forms of expressing content:

Proposal DareObasanjo: There are two forms of escaping: I'd like to withdraw this proposal [DareObasanjo]

Proposal KenMacLeod:

[SjoerdVisscher, RefactorOK] Re. encoding="none" instead of encoding="escaped": The term "escaped" might give the prorammer the idea that he has to escape something. But unless you are creating XML without XML library, you don't have to do anything.


Discussion Summary

Escaped:

Inline:

Determination based on content type:


Further Discussion

[TimBray RefactorOk] Having read all this, it seems that it's not that complicated. One of the two following is true:

Refusing to deal with the second case is attractive but stupid, as we can't ignore the legacy problem. Forcing all markup to be encoded to cater to the legacy is bad design. Thus it seems to me like the only plausible solution is one of the following

Let's pick one of these and move on.

[KenMacLeod] Is this the same or different from <content type="text/html" encoding="none"> (which means that standard XML escaping is used) and <content type="text/xhtml" encoding="literal"> (which means that XHTML content is parsed).

One part that confuses me is I seem to see a case where type="text/plain" but it's "still" text/html being sent and the reader somehow has to figure that out. That may be what em="true|false" means, but if you know that it's escaped markup, why not use type="text/html"?


Example, CDATA (1):

  <content type="text/html"><![CDATA[
     <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>
  ]]></content>

Example, escaping (2):

  <content type="text/html">
     &lt;p&gt;Hello, &lt;em&gt;weblog&lt;/em&gt; world! 2 &amp;lt; 4!&lt;/p&gt;
  </content>

Example, inline with default namespace (3):

  <content type="application/xhtml+xml">
    <body xmlns="http://www.w3.org/1999/xhtml">
     <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>
    </body>
  </content>

Example, inline with namespace (4):

  <content type="application/xhtml+xml" xmlns:x="http://www.w3.org/1999/xhtml">
     <x:p>Hello, <x:em>weblog</x:em> world! 2 &lt; 4!</x:p>
  </content>

[ArveBersvendsen, RefactorOk] I've provided four different samples. My personal view is that (1) or (2) should be a MAY only where the content-type is text/plain or text/html. For application/xhtml+xml, they should be noted as SHOULD NOT. For application/html+xml, either of (3) or (4) should be marked as MUST, and (1) or (2) as MUST NOT. [TimBray] Your (3) doesn't work, because there's no <content> element in the HTML namespace. You could put a <div> in or something. [ArveBersvendsen] I changed this to use <body>, which I believe is cleaner.

[TimBray RefactorOk] Summing up a conversation between myself and AaronSw (I think) here: I find interpreting escaped content kludgy and horrible; among other things, readability is severely impaired. On the other hand, if you allow for escaped tag soup, then people can use Echo to archive their last five years of ill-formed postings. So it seems that at least the option of escaping content should be allowed. On the other hand, I predict that in five years time, Echo will still be very popular, and at that time, the notion of generating non-well-formed markup will feel archaic and barbaric, and people will really wonder why we are making them add this ugly level of overhead. My tentative conclusion is that we should have an optional markup="escaped" attribute on the <content> element, for when you need to do this. But the default action should be to do the right thing, which is to generate well-formed content. Now we can restart the argument :) Other voices?

[AaronSw] How is readability impaired with CDATA?
[TimBray] Specific answer: it's basically not OK to mandate the use of CDATA. For details, see section 4.3 of [WWW]RFC 3470 (if you poke around, you can find an easier-to-read HTML version), also known as IETF Best Common Practice #70. By the way, anyone who's planning to anything serious with XML should study that document, it contains a lot of highly concentrated wisdom.

[DaveWarnock, RefactorOk] I suggest we allow XHTML only. Where your content is not well formed XHTML then you use a standard snippet of XHTML which includes a url for your content. If that url returns the correct content-type then normal standards will control what the client does with it. This should be simple for people with loads of older content while keeping the standard very clean for the long term.

[BillHumphries, RefactorOk, OutrightDeletionNotOk] While I'd like to be stern and strict, I'd recommend that the default payload is assumed to be XHTML unless there's an encoding attribute. If the content is not XHTML, escaping while regretable, is preferable. When working with XML returned from a certain large search engine company, the descriptions of pages are in a node as escaped HTML, and one can write:

<xsl:value-of disable-output-escaping="yes" select="foonode" />

[SeanMcGrath, RefactorOk]XML's three special characters that need to be escaped can be worked around by adding elements called amp, lt and gt. I use this a lot in XML vocabularies because of the "2 to the n-1 ampersand" escaping problem (I call it 'ampersand attrition' in an article I wrote on the subject: http://www.itworld.com/nl/xml_prac/07042002/). Obviously it does not help in literal chunks of markup but I suggest is worth considering for WF payloads.

[MSM, RefactorOk] I see the possibility of syndicating a lot more than just textual content that can be delivered in XML form. I rather prefer what's shown in EchoExample (as of this writing), with maybe encoding added (base64 or ), and perhaps allow for specification of content length (so a feed consumer can skip anything it doesn't want to deal with on a size basis), and also an optional external reference that can be used when the EchoFeed consumer doesn't know what to do with the content fragment itself (or prefers to pass fetching/processing off to some other app or OS service). I guess where I differ from most of the above is that I'd always put the content as CDATA, as the content pieces themselves are meant, I always thought, to be atomic units. After the EchoFeed consumer unpacks the feed, it operates on the content units as it sees fit -- it may display them inline in the feed consumer's application display, or may show them as attachments, may discard them if they fail some security measure, etc.

<content type="text/html" xml:lang="en-us" encoding="UTF-8" length="48">
  <![CDATA[ <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p> ]]>
</content>

<content type="img/x-png" encoding="base64" length="31415" href="http://example.com/foo.png">
<![CDATA[
  ... imagine base64 data here ...
]]>
</content>

<content type="img/svg+xml" encoding="UTF-8" length="2112" href="http://example.com/bar.svg">
<![CDATA[
  ... imagine SVG document here ...
]]>
</content>

[LeonardoHerrera RefactorOk DeleteOk] Ugh, that "length" attributes scares me. I can envision a non-stop flow of bad implemented "length" attributes; thus, nobody will rely in that datum. It's really necessary to include it? If not, I would prefer not to mention it at all, even if it is optional.

[JeremyGray RefactorOk] -1 for the length attribute being redundant given that in the example all of the options and their data have already been delivered, so why use anything but the largest and most pre-prepared version. An additional -1 for stepping even closer to the edge of the slippery slope called feature creep.

[AsbjornUlsberg, RefactorOk] I don't like CDATA'ing all content, nor do I like the length attribute. If we allow inline binary data, the length attribute is necessary, but I still dislike the idea of embedding images in XML. It's better to refer them extarnally then, imho.


[TimBray RefactorOk] I take what Brent says seriously, but forcing authors to escape everything has a pretty severe price in both readability and writeability. Brent is pretty convincing that the consumers would be happier with everything escaped, so it's a matter of whether we care more about making things easy for hand readers/writers or authors processing code. Not a slam-dunk either way.

[AaronSw, RefactorOk] Huh? CDATA sections make quoted practically just as easy to read and write as literal. Can you seriously claim that:

<foo><![CDATA[bar]]></foo>

is so much harder to read and write than

<foo>bar</foo>

that we should make consumers go to horrible kludges?

[KenMacLeod, RefactorOk] It may be a development style. Far back into my SGML days playing with DocBook and mapping DocBook structure elements and mixed content into Perl objects, I would stash the mixed content as DOM-like (grove) objects. Today, I swap in a DOM-building SAX handler whenever I recognize I'm gonna have literal XML to preserve. People using DOM parsers, XPath, and XSLT already have the content as an element node. I believe it's these latter folks that benefit the most from literal or inline XML.

[JoeGregorio] As an aggregator builder I do see a use for both the inline and escaped content, mostly because Aggie uses a web browser as it's output format. Stripping 'insecure' tags and attributes from HTML is easier, and more reliable, using XPath+DOM than it is using regex's. I am now using regex's but will soon switch to using Tidy on the content and then strip it's output via XPath+DOM. This is where the 'choice' for CMS vendors gets involved, and why I think we need a solution that allows both in-line and escaped. If tools like TypePad can produce well-formed XHTML all the time then their content can be inlined and when I read a feed from them I don't have to do the 'Tidy' step first, and that is a big savings in processing time. Obviously this is a concern because I am using a browser as the aggregator display device, and if your not displaying in a web browser, then your mileage may vary.

[DareObasanjo] As an aggregator author all I can say is that Brent Simmons speaks for himself not for everyone who's building an aggregator. For instance, RSS Bandit allows users to create XSLT stylesheets that are used as themes over the content provided by the blog ([WWW]screenshot1 and [WWW]screenshot2) and I'd much prefer to consume well-formed XHTML and pass that to the XSLT engine as opposed to running the equivalent of HTML Tidy on the content every single time.

[HenriSivonen] Whether escaped (payload as string) is more convenient than inline (payload as subtree) largely depends on the interface between the syndication format processor and the content renderer. If I've understood correctly, the interface in NetNewsWire is that the RSS component hands the payload as a tag soup string to the Cocoa tag soup renderer. However, if one were to implement an aggregator over Mozilla (for example), the interface could conveniently accept a document tree. In such a case, the Echo component could pass a namespaced DOM subtree to the XHTML renderer.

Parsing tag soup is hard, so if the renderer interface wants a tree, parsing is non-trivial. On the other hand, serializing a tree to a string is easy. Therfore, in order to serve both kinds of renderer interfaces, it would make more sense to choose the payload as subtree model for the wire format instead of the payload as string model.

However, from the feed producing point of view, it is of course easier to spit out tag soup as string instead of producing proper XHTML document trees.

[DonPark DeleteOk RefactorOk] Allow me to make a proposal which might not be as flexible as everybody wants, but is simple enough to support legacy issues as well leaving the door open for the future.

  1. we support existing mass of RSS contents through 'legacy' type.

    <content type="legacy">
       same as RSS 2.0 <description> value
    </content>
     
  2. we support XHTML contents through 'xhtml' type

    <content type="xhtml">
       unescaped XHTML document fragment
    </content>
     

    Implication: aggregators that support 'xhtml' type must pre-define XHTML character entities

  3. we support plain text contents through 'text' type

    <content type="text">
       plain text content
    </content>

  4. we support general XML contents through 'xml' type.

    <content type="xml">
       unescaped XML document fragment
    </content>
  5. Other type values are reserved except for those that start with "x-"

    All aggregators must support 'legacy', 'xhtml', and 'text' types. Rest are optional.

    Unknown or unsupported content types are to be ignored.

[ZhangYining RefactorOk] I am for:

[RichardTallent RefactorOk] What we have are two orthogonal issues:

New tool developers should not be stymied because of broken, unescaped HTML. Escaping requires only one or two commands on most modern platforms, but parsing FrontPage-esque HTML is a major undertaking. Put the burden on the publisher, not the consumer.

[AsbjornUlsberg] +1.000.000

[RolandWeigelt] Hmm, maybe a stupid question (and maybe I did miss something really obvious)... People seem to prefer CDATA encoding vs. entity encoding. What I don't see mentioned is that CDATA (which I generally like, BTW) has one big flaw: The text inside CDATA must not contain "]]>". Entity encoded text can be encoded over and over again. But how do you encode e.g. an XML example that contains a CDATA section using CDATA?

[HenriSivonen] All the examples show only a subset of an (X)HTML document embedded in content. That is, html, body and head have been omitted. However, the Atom 0.2 snapshot doesn't mention that (X)HTML is special in the sense that mandatory parts of an (X)HTML document may be omitted when embedded in Atom.

Resolution

Considerable thought has gone into the discussion of this issue. Coincidentally, there is a third draft of the Necho RFC. What needs to happen (e.g., What needs to be added? What needs to be taken away?) to move forward to resolution of this topic?


[JeremyGray] Someone felt it appropriate to delete my comment, one not marked with either RefactorOk or DeleteOk. This wouldn't really bother me if the changes made to the poll in any way reflected my comments, which were (and still are):

To clarify the above even further, my point regarding 'accurate terminology' had nothing do to with the words 'yay' or 'nay'. It had to do with the misuse of the words 'quoted' and 'inline', neither of which are accurate terminology for the concepts being discussed on this wiki page. Further, the two presented choices, even once considered using accurate terminology, don't reflect the options that have actually been discussed here. At this point I'd like to see the poll rebuilt into a new poll or set of polls by an individual who was actively involved in the discussion on this page.

[JonathanSmith] Sorry, I was the one who deleted your comment and also made the cut & paste mistake with the poll. Later I realized my mistake and hoped that someone would refactor it. Instead, I came back to your criticism, so I made the changes I thought were appropriate... In the spirit of wiki I would invite you to do likewise.


See also content


CategoryMetadata, CategoryModel, CategorySyntax