It’s just data

Making encoding explicit

From the current iteration of EchoExample:

<content type="application/xhtml+xml" xml:lang="en-us">
  <p xmlns="...">Hello, <em>weblog</em> world! 2 &lt; 4!</p>
</content>

<content type="text/html" xml:lang="en-us">
  <![CDATA[<p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>]]>
</content> 

<content type="text/plain" xml:lang="en-us">
  <![CDATA[ Hello, _weblog_ world! 2 < 4! ]]>
</content>

Looking at this, I am troubled by the implicit knowledge of encoding that is required.  Less than signs in XHTML are encoded once.  The same thing needs to be encoded (or wrapped in CDATA) twice for HTML.

How many times should title be decoded?  Aggregators today generally have to guess.  And such guesses have caused problems in the past.

We can't simply rely on <![CDATA[ ]]> as this may be already required in order to singly encode the data as XML to begin with.

I'd like to see a single element, say <encoded>, be introduced that makes such encoding explicit.  We can decide where such elements are allowed. We might even want to explore alternate encodings, such as base64Binary for handling things like pictures in archives.

Example:

<content type="application/xhtml+xml" xml:lang="en-us">
  <encoded>Hello, &lt;em&gt;weblog&lt;/em&gt; world!</encoded>
</content>

The less-than signs:  I don't see this as a problem.  XML is a special case where you have namespaced the contents.  Therefore, it's subject to different rules.

If you want consistency, then don't use namespaces for this.  Use CDATA just like the other encodings.  Then, for each format, all you need to care about is that it is a proper CDATA representation of the content.

I'm not sure I understand the problem with title.  You are wondering what happens when the actual title within the content needs to be represented with CDATA, and then has to be wrapped as CDATA again?  I thought the correct approach to that would be to write the whole thing as CDATA, or, alternatively, nest the CDATA (you can do that can't you?)

Posted by Jim at

Dare Obasanjo

Sam,
When talking about XML the term &quot;encoding&quot; is overloaded. It took me a few seconds to realize what you were talking about. Anyway, properly dealing with the various encoding issues in RSS Bandit has probably been one of the biggest causes of consternation on my part so I feel obliged to provide input. :)

Your suggestion of an encoded element seems like a hack. Currently I don't have to encode any of the elements in the <content/> element. Why are you trying to complicate issues?

Discussions of the EchoExample follow

In the application/xhtml+xml example it depends on the XML API I use to determine whether I should encode the content before I shoot it off to a browser. If I use the XmlTextReader in the .NET Framework I have to encode it but if I'm starting from an XmlDocument object I don't. In RSS Bandit I use the XmlDocument so I don't have to encode that example.

In both the the text/html and text/plain version I just grab the InnerText of the XmlNode or the Value if the XmlTextReader and feed that to the browser.

Basically the encoding hack you described is only necessary for people processing Echo without an XML processor. The major problem with encoding stuff correctly has nothing to do with CDATA or stuff being doubly encoded but the fact that there was no way to tell whether the content of an element was going to be text/plain or text/html

Message from Dare Obasanjo at

There is an alternative suggestion on the EchoExample page, where encoding explicit through the use of an attribute for the &lt;content /&gt; element.

Posted by Arve at

Sam,
  By the way yo do realize that there is no difference between

<encoded>Hello, &lt;em&gt;weblog&lt;/em&gt; world!</encoded>

and

<encoded><![CDATA[Hello, weblog world!]]></encoded>

So I'm not sure what your comments about CDATA mean exactly.

Posted by Dare Obasanjo at

Dare, to a conforming XML parser there may not be much of a difference between CDATA and entity encoding, but to a tagsoup parser, there is a big difference.

(And unfortunately, I do believe we will have to live with tools that does not use an XML parser for a long time to come)

Posted by Arve Bersvendsen at

Dare Obasanjo

Arve,
I'm sorry but I think it is ridiculous to define a brand new XML format with the explicit goal of attempting to make it work with non-XML processors. Seriously, what platforms and programming languages do not have an XML parser?

Message from Dare Obasanjo at

quote... Looking at this, I am troubled by the implicit knowledge of encoding that is required. imho, another level of abstraction like an encoded element and use of base-64 only complicates the issue. yes entity and cdata encoding makes writing...

Excerpt from iBLOGthere4iM at

Dare, I do agree with you in the respect that I think using tagsoup parsers is using a hammer where a screwdriver would be more useful, and I do not encourage the use of the wrong tool for the right job.  I am just worried that if we end up with a specification that makes life exceedingly difficult for these tools, adoption of a new standard will take much, much longer.

Posted by Arve Bersvendsen at

Looking at this, I'm troubled by how much it looks like the RSS 1.0 content module from before Aaron added content:encoded. How many people actually used content:item and friends? I remember having to search quite a bit to find a single example.

Posted by Phil Ringnalda at

Dare Obasanjo

Avre,
  Adding hacks to your format to support some improperly coded tools is a bad idea. End of story. Seriously, where do you draw the line? I want to be able to process RSS with wget and grep. Should Sam then specify a crippled version of XML so that people like me should be able to to do this?

OK, how about people who don't want to have to deal with unicode but want to process all the XML as ASCII, after all this is the brokenness of XML-RPC and it lets improperly coded tools that claim to support XML get away with it. Should Echo also support this and basically dismiss all non-English users of the format?

I'd understand if the barrier to entry was high in processing XML but given that you I haven't seen any major platform where an XML parser doesn't exist you are simply stating that the format should be bent to satisfy programmers who are not interested in doing due diligence but instead would rather reinvent the wheel and hackup a ghetto XML parser in a weekend.

Message from Dare Obasanjo at

I don't much like this because I'd rather use a format that was valid XML.  Like it or not the CDATA thing is how you handle embedding.  Encoding looks equivilent to me.  As a user rather than a producer of these tools/etc, I don't think I'd support a format that wasn't simple valid XML.  Perhaps to solve the validity problem, what you need are tools on the server which barf when they output invalid XML rather than on the client.  Stick these in highly valuable libarary places and you've got a fighting chance people will use them.  (Example: Xerces, Xalan, Axis) -- I'll now let you return to your normal discussion of tag pedantry. :-)

Posted by Andy at

I think the correct term here is escaping, not encoding. This should not be discussed in terms of syntax, but in terms of the model.

What you need to know is the mime type and the text content of the element. "text content" needs to be defined first:

- In the MSXML DOM it is the "text" property of the element.
- In DOM3 it is the "textContent" property.
- In older DOMs it is the concatenation of the nodeValue property of the child nodes of the element that have nodeType 3 or 4. (text node or CDATA node)
- In the XPath model it is the result of the "string()" function applied to the element.
- In the InfoSet it is the concatenation of all the children of the element that are Character Information Items.

So once you have the text content, and you know what mime-type it has, it depends on what you want to do with the content.

If you are generating text/plain output, you can directly use text/plain text content. For text/html you'll want to strip the tags, and maybe do some pretty printing.

If you are generation text/html output, you can directly use text/html text content. (Maybe check for unclosed tags) text/plain content should be escaped according to http://www.w3.org/TR/html401/charset.html#h-5.3.2

If you are generating application/xhtml+xml, you should escape text/plain according to http://www.w3.org/TR/REC-xml#syntax
If you want to use text/html, you'll need some pretty fancy clean-up routines (like htmltidy) to make it work.

So I agree with Dare. There's no problem (for the Echo format that is) once you know if a string is text/plain or text/html. It's up to the spec to be clear about this everywhere.

Posted by Sjoerd Visscher at

Although I'm pretty agnostic about whether or not one would want to encode *ML (with a lean to "we should be moving to not"), I'm definitely sure we need to provide encoding for many things not *ML, such as binary resources.

What I strongly disagree with is an XML element name that denotes a characteristic of content rather than the relation of that content in respect to its parent element.  Attributes, or sibling elements, should be used to denote characteristics of contents.

Posted by Ken MacLeod at

Now to specific examples.

The default encoding, 'none', means the content is the value and no additional decoding is necessary.  Ie. it's a string.

<content type="text/html" xml:lang="en-us">
  <![CDATA[<p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>]]>
</content>

An attribute of 'encoding="none"' could have been used there too.

For XML, a "literal" encoding is used to indicate that the content contains parsed XML elements:

<content type="application/xhtml+xml" xml:lang="en-us" encoding="literal">
  <p xmlns="...">Hello, <em>weblog</em> world! 2 &lt; 4!</p>
</content>

For binary, a "base64" encoding is used to indicate the content is binary:

<content type="image/jpeg" encoding="base64">
  xo+0AFh5Fh1mdsTbTRt781hrVbF3
</content>

Posted by Ken MacLeod at

For reference, I also believe that there should be only one content child of an entry, and the types 'multipart/alternative' and 'multipart/related' are specifically recognized to mean that additional 'content' elements are the children of the content element:

<content type="multipart/alternative">
  <content type="text/html" xml:lang="en-us">
  <![CDATA[<p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>]]>
  </content>
  <content type="application/xhtml+xml" xml:lang="en-us" encoding="literal">
  <p xmlns="...">Hello, <em>weblog</em> world! 2 &lt; 4!</p>
  </content>
  <content type="image/jpeg" encoding="base64">
  xo+Hello0AFWeblogh5FWorldh1mImagedsTbrVbF3
  </content>
</content>

I first saw this style in Jonathan Borden's XML MIME Transformation Protocol (XMTP).

Posted by Ken MacLeod at

Rereading, I think I'm agreeing with Dare and Sjoerd.

Some more clarifications:

The distinction between plain text and html to be stripped and rendered is based on the content type, not the encoding.

encoding="none" is, by definition of XML, "escaping" XML special characters, there's no need for a seperate "escaped" encoding.

Content types are, in this context, required to be in the XML character set encoding of the containing instance, and a 'charset' modifier cannot contradict the instance, but only, for example, specify a subset of UTF XML encoded instance this value comes from.

MIME uses the term 'encoding' to mean a modification of the value, and 'charset' to mean the character set used.  XML uses the term 'encoding' to mean the character set used.

Now, just to find a place to raise the visibility of this in the wiki...

Posted by Ken MacLeod at

Right. Escaping and encoding are orthogonal issues. My post was about the 'encoding="none"' case. For the encoding="base64" case, you first need to do a base64 decoding step, and then my post applies again. Encoding="literal" only applies to xml content and cannot be used for text/plain or text/html.

So indicating the encoding is useful, and I would prefer an attribute, not an element. But it has hardly anything to do with the escaping problem.

The multipart content is interesting, but I think too complex for Echo 1.0. Also in the case of multipart/related, you'd want one part to be able to refer to another part. (f.e. a text/html part that refers to an image/jpeg part)

It is a good reason though for not allowing multiple content elements in Echo 1.0, and delaying a solution for that to a later version.

Posted by Sjoerd Visscher at

Sjoerd,

Interestingly, if the "content" was made to be format-neutral (that is, taking any mime type), the multipart stuff would just work anyway.

Ken,

Your comments are right on, IMO.

Posted by Chris Wilper at

On rereading all arguments, Sjoerd's in particular, I'm changing my mind; no explicit "encoding" element or attribute should be needed, specifying type="..." should be enough.

Posted by Arve Bersvendsen at

Sjoerd, if I understand you correctly, there is no escaping issue.

In the bytes in a file on disk, if one says those bytes are "text/plain" they have to be run through html.escape(bytes) before display.  If the bytes are "text/html" and you run them through html.escape(bytes) all < & and > will show up on the web page literally, just as they do with plain text.  The answer there is, knowing that it's "text/html" you instead run it through html.scrub(bytes) to scrub it, then present it directly.

The XML content of the <content> element are just bytes, it's the content type that tells one what to do with them.

Re. multipart/related, that's a solved problem, using the URI scheme 'cid:'.

Re. the conceptual model of content, I don't think a change like this could be cleanly made in the future without backward incompatibility or a new-content extension that logically replaces <content>.

Posted by Ken MacLeod at

I know this is not an IANA-assigned mimetype, but application/base64 is readily recognized by MSIE as a mimetype. Microsoft documentation

One could alternatively agree on using application/x-base64.

Posted by Arve Bersvendsen at

er, technically, the XML content are characters, not bytes.

Posted by Ken MacLeod at

About <content> in general, the use of content-type and/or encoding is a significant "invention" over what is happening in current usage.  Whether or not it's a good invention and won't get us on the slippery slope of invention is another matter.

Still thinking too fast to write, here.  You are correct, Sjoerd, multipart content can be left to future development easily:

multipart/alternative and multipart/related can be ''reserved for future use'' with an assumed or required encoding of "literal" and any client not recognizing that content-type should discard the content and present "Content type multipart/foo cannot be displayed" just as any other unpresentable content type.

Posted by Ken MacLeod at

Agreed that we need to be clear about the difference between escaping and encoding.  My take on this is in the Wiki near the top of http://intertwingly.net/wiki/pie/EscapedHtmlDiscussion

Posted by Tim Bray at

Arve, using a content type of application/base64 make the real content type "unknown".  application/base64 literally means, "this is binary data of unknown type".

Posted by Ken MacLeod at

Theoretically there is no escaping issue. The issue is that programmers tend to do random encoding/decoding/escaping/unescaping until the result looks ok. This is a result from the problem that programming languages just have one string type, instead of keeping track what kind of string it is (subclassing the String) and providing the proper conversions. So the issue should be solved by proper education in the Echo specification with clear examples. Noting that XML text contents are unicode characters, not bytes, is a good example.

Using application/base64 is very bad. It is an encoding scheme, not a content type. You need to know what the mime-type is of the resulting bytestream after decoding.

Posted by Sjoerd Visscher at

My bad.  Yes, there is an issue and it is specification, not technical difficulty.  Thx!

Posted by Ken MacLeod at

I support the approach suggested by Ken MacLeod -- it is, IMHO, the best way to capture the information about entry content.
A couple of questions:



Posted by Misha Dynin at

Sjoerd, Ken: Re: application/base64 - I concur.  I had set the attribute think-before-posting="false".

Posted by Arve Bersvendsen at

Dare Obasanjo

Misha,
What encoding element? I see an encoding attribute in his example and from the various discussions it seems Ken realizes that it is unneccesary given that there already is a content-type attribute.

Perhaps you didn't read the entire thread?

Message from Dare Obasanjo at

Can you please check my characterization of your proposal at the top of EscapedHtmlDiscussion? Here's what's there now:

Proposal Ruby: There are two forms of encoding:

SOMELIST is initially defined as <quoted> and <base64> but can change at any time breaking all old code.

Is that right?

Posted by Aaron Swartz at

Dare Obasanjo

Aaron,
The discussion in this thread have shown that both of these proposals are unnecessary and in fact are hacks. I think your description confuses issues especially by using &quot;encoding&quot; when &quot;escaping&quot; is meant. If you need a summary of why none of the proposals is needed read http://www.intertwingly.net/blog/1500.html#c1056803387 and http://www.intertwingly.net/blog/1500.html#c1056814829

I'd update the Wiki myself but I don't feel like dealing with bruised egos of people whose content gets edited.

Message from Dare Obasanjo at

It's not true that the encoding attribute is unneccesary because there already is a content-type attribute (named simply "type" in current proposals).

The 3 values of encoding Ken proposed, "none", "literal" and "base64" are useful. However, each mime-type has one usually obvious preferred encoding. "none" for text/*, "literal" for */*+xml and "base64" for image/*. So for simplicity Echo should require the preferred encoding for each mimetype, so the encoding attribute is not needed.

Seperate from this discussion I don't think Echo should support base64. Images can simply be published online, and referred to by its URL. This has worked perfectly for RSS.

Posted by Sjoerd Visscher at

(I need to post faster)

Rereading my last post, there's hardly a difference between "unneccesary" en "not needed", so Dare is right.

The specification must indicate that a content element is handled differently depending on it's mimetype.

Posted by Sjoerd Visscher at

I've placed this proposal in content.  I believe I've covered most of the issues in the "See also" pages.

Posted by Ken MacLeod at

Yes, I meant encoding attribute.  Sorry.  It can be called "format" -- "encoding" is overloaded.
I am against auto-detecting default format based on media type because it is error-prone.

Posted by Misha Dynin at

The nice thing is that you don't need to detect anything. You need to handle each media type separately anyway. If the encoding is fixed for each media type, it's much easier to parse the Echo feed.

Posted by Sjoerd Visscher at

What is wrong with our current approach with content:encoded and xhtml:body in RSS 1.0?  You are just proposing different XML syntax.

Did I misread?

Whether a CDATA section or encoded content is used shouldn't matter from an XML parser perspective but it might matter from the profile perspective.

Kevin

Posted by Kevin Burton at

Kevin, one of the clear use-cases for several new blog wares are media content types beyond text/plain, text/html, and application/xhtml+xml.

While the new syntax is comparably similar to description, content:encoded, and xhtml:body for those types, the 'content' proposals are not restricted to just those types.

Re. CDATA, my opinion is that it is inappropriate for an application of XML to make any more of a distinction or preference between CDATA sections and character data than the XML spec does, which is to say, virtually none.

Posted by Ken MacLeod at

Dare, can you confirm that the primary difference in yours and my approach is that you're using the content type to determine whether the content is inline (XML elements and character data, I use the term "literal") or escaped (nee "quoted")?  Thx.

Posted by Ken MacLeod at

Dare Obasanjo

Ken,
Yup, confirmed.

Message from Dare Obasanjo at

Escaped vs. Unescaped Markup

This looks about right....

Excerpt from Don Box's Spoutlet at

Rereading, I notice Dare writes, "I see an encoding attribute in his example and from the various discussions it seems Ken realizes that it is unnecessary given that there already is a content-type attribute."

That's incorrect.  I've got this gut feeling that basing the encoding on the content type won't work -- something along the lines that not all XML content, now and in the future, have a mime type '*+xml' or that mime-only types are the only right way to go.

Tim Bray makes a similar statement ("tying the escaping level to the type= is attractive but doesn't quite work"), but doesn't say why.

Posted by Ken MacLeod at

Also in the case of multipart/related, you'd want one part to be able to refer to another part.

Posted by Fred Hurb at

Add your comment