Looking at this, I am troubled by the implicit knowledge of
encoding that is required. Less than signs in XHTML are
encoded once. The same thing needs to be encoded (or wrapped
in CDATA) twice for HTML.
How many times should title be decoded? Aggregators today
generally have to guess. And such guesses have
caused problems in the past.
We can't simply rely on <![CDATA[ ]]> as this may be
already required in order to singly encode the data as XML to begin
with.
I'd like to see a single element, say <encoded>, be
introduced that makes such encoding explicit. We can decide
where such elements are allowed. We might even want to explore alternate
encodings, such as
base64Binary
for handling things like pictures in archives.
The less-than signs: I don't see this as a problem. XML is a special case where you have namespaced the contents. Therefore, it's subject to different rules.
If you want consistency, then don't use namespaces for this. Use CDATA just like the other encodings. Then, for each format, all you need to care about is that it is a proper CDATA representation of the content.
I'm not sure I understand the problem with title. You are wondering what happens when the actual title within the content needs to be represented with CDATA, and then has to be wrapped as CDATA again? I thought the correct approach to that would be to write the whole thing as CDATA, or, alternatively, nest the CDATA (you can do that can't you?)
Sam,
When talking about XML the term "encoding" is overloaded. It took me a few seconds to realize what you were talking about. Anyway, properly dealing with the various encoding issues in RSS Bandit has probably been one of the biggest causes of consternation on my part so I feel obliged to provide input. :)
Your suggestion of an encoded element seems like a hack. Currently I don't have to encode any of the elements in the <content/> element. Why are you trying to complicate issues?
Discussions of the EchoExample follow
In the application/xhtml+xml example it depends on the XML API I use to determine whether I should encode the content before I shoot it off to a browser. If I use the XmlTextReader in the .NET Framework I have to encode it but if I'm starting from an XmlDocument object I don't. In RSS Bandit I use the XmlDocument so I don't have to encode that example.
In both the the text/html and text/plain version I just grab the InnerText of the XmlNode or the Value if the XmlTextReader and feed that to the browser.
Basically the encoding hack you described is only necessary for people processing Echo without an XML processor. The major problem with encoding stuff correctly has nothing to do with CDATA or stuff being doubly encoded but the fact that there was no way to tell whether the content of an element was going to be text/plain or text/html
Dare, to a conforming XML parser there may not be much of a difference between CDATA and entity encoding, but to a tagsoup parser, there is a big difference.
(And unfortunately, I do believe we will have to live with tools that does not use an XML parser for a long time to come)
Arve,
I'm sorry but I think it is ridiculous to define a brand new XML format with the explicit goal of attempting to make it work with non-XML processors. Seriously, what platforms and programming languages do not have an XML parser?
quote... Looking at this, I am troubled by the implicit knowledge of encoding that is required. imho, another level of abstraction like an encoded element and use of base-64 only complicates the issue. yes entity and cdata encoding makes writing...
Dare, I do agree with you in the respect that I think using tagsoup parsers is using a hammer where a screwdriver would be more useful, and I do not encourage the use of the wrong tool for the right job. I am just worried that if we end up with a specification that makes life exceedingly difficult for these tools, adoption of a new standard will take much, much longer.
Looking at this, I'm troubled by how much it looks like the RSS 1.0 content module from before Aaron added content:encoded. How many people actually used content:item and friends? I remember having to search quite a bit to find a single example.
Avre,
Adding hacks to your format to support some improperly coded tools is a bad idea. End of story. Seriously, where do you draw the line? I want to be able to process RSS with wget and grep. Should Sam then specify a crippled version of XML so that people like me should be able to to do this?
OK, how about people who don't want to have to deal with unicode but want to process all the XML as ASCII, after all this is the brokenness of XML-RPC and it lets improperly coded tools that claim to support XML get away with it. Should Echo also support this and basically dismiss all non-English users of the format?
I'd understand if the barrier to entry was high in processing XML but given that you I haven't seen any major platform where an XML parser doesn't exist you are simply stating that the format should be bent to satisfy programmers who are not interested in doing due diligence but instead would rather reinvent the wheel and hackup a ghetto XML parser in a weekend.
I don't much like this because I'd rather use a format that was valid XML. Like it or not the CDATA thing is how you handle embedding. Encoding looks equivilent to me. As a user rather than a producer of these tools/etc, I don't think I'd support a format that wasn't simple valid XML. Perhaps to solve the validity problem, what you need are tools on the server which barf when they output invalid XML rather than on the client. Stick these in highly valuable libarary places and you've got a fighting chance people will use them. (Example: Xerces, Xalan, Axis) -- I'll now let you return to your normal discussion of tag pedantry. :-)
I think the correct term here is escaping, not encoding. This should not be discussed in terms of syntax, but in terms of the model.
What you need to know is the mime type and the text content of the element. "text content" needs to be defined first:
- In the MSXML DOM it is the "text" property of the element.
- In DOM3 it is the "textContent" property.
- In older DOMs it is the concatenation of the nodeValue property of the child nodes of the element that have nodeType 3 or 4. (text node or CDATA node)
- In the XPath model it is the result of the "string()" function applied to the element.
- In the InfoSet it is the concatenation of all the children of the element that are Character Information Items.
So once you have the text content, and you know what mime-type it has, it depends on what you want to do with the content.
If you are generating text/plain output, you can directly use text/plain text content. For text/html you'll want to strip the tags, and maybe do some pretty printing.
If you are generation text/html output, you can directly use text/html text content. (Maybe check for unclosed tags) text/plain content should be escaped according to http://www.w3.org/TR/html401/charset.html#h-5.3.2
If you are generating application/xhtml+xml, you should escape text/plain according to http://www.w3.org/TR/REC-xml#syntax
If you want to use text/html, you'll need some pretty fancy clean-up routines (like htmltidy) to make it work.
So I agree with Dare. There's no problem (for the Echo format that is) once you know if a string is text/plain or text/html. It's up to the spec to be clear about this everywhere.
Although I'm pretty agnostic about whether or not one would want to encode *ML (with a lean to "we should be moving to not"), I'm definitely sure we need to provide encoding for many things not *ML, such as binary resources.
What I strongly disagree with is an XML element name that denotes a characteristic of content rather than the relation of that content in respect to its parent element. Attributes, or sibling elements, should be used to denote characteristics of contents.
For reference, I also believe that there should be only one content child of an entry, and the types 'multipart/alternative' and 'multipart/related' are specifically recognized to mean that additional 'content' elements are the children of the content element:
Rereading, I think I'm agreeing with Dare and Sjoerd.
Some more clarifications:
The distinction between plain text and html to be stripped and rendered is based on the content type, not the encoding.
encoding="none" is, by definition of XML, "escaping" XML special characters, there's no need for a seperate "escaped" encoding.
Content types are, in this context, required to be in the XML character set encoding of the containing instance, and a 'charset' modifier cannot contradict the instance, but only, for example, specify a subset of UTF XML encoded instance this value comes from.
MIME uses the term 'encoding' to mean a modification of the value, and 'charset' to mean the character set used. XML uses the term 'encoding' to mean the character set used.
Now, just to find a place to raise the visibility of this in the wiki...
Right. Escaping and encoding are orthogonal issues. My post was about the 'encoding="none"' case. For the encoding="base64" case, you first need to do a base64 decoding step, and then my post applies again. Encoding="literal" only applies to xml content and cannot be used for text/plain or text/html.
So indicating the encoding is useful, and I would prefer an attribute, not an element. But it has hardly anything to do with the escaping problem.
The multipart content is interesting, but I think too complex for Echo 1.0. Also in the case of multipart/related, you'd want one part to be able to refer to another part. (f.e. a text/html part that refers to an image/jpeg part)
It is a good reason though for not allowing multiple content elements in Echo 1.0, and delaying a solution for that to a later version.
On rereading all arguments, Sjoerd's in particular, I'm changing my mind; no explicit "encoding" element or attribute should be needed, specifying type="..." should be enough.
Sjoerd, if I understand you correctly, there is no escaping issue.
In the bytes in a file on disk, if one says those bytes are "text/plain" they have to be run through html.escape(bytes) before display. If the bytes are "text/html" and you run them through html.escape(bytes) all < & and > will show up on the web page literally, just as they do with plain text. The answer there is, knowing that it's "text/html" you instead run it through html.scrub(bytes) to scrub it, then present it directly.
The XML content of the <content> element are just bytes, it's the content type that tells one what to do with them.
Re. multipart/related, that's a solved problem, using the URI scheme 'cid:'.
Re. the conceptual model of content, I don't think a change like this could be cleanly made in the future without backward incompatibility or a new-content extension that logically replaces <content>.
About <content> in general, the use of content-type and/or encoding is a significant "invention" over what is happening in current usage. Whether or not it's a good invention and won't get us on the slippery slope of invention is another matter.
Still thinking too fast to write, here. You are correct, Sjoerd, multipart content can be left to future development easily:
multipart/alternative and multipart/related can be ''reserved for future use'' with an assumed or required encoding of "literal" and any client not recognizing that content-type should discard the content and present "Content type multipart/foo cannot be displayed" just as any other unpresentable content type.
Arve, using a content type of application/base64 make the real content type "unknown". application/base64 literally means, "this is binary data of unknown type".
Theoretically there is no escaping issue. The issue is that programmers tend to do random encoding/decoding/escaping/unescaping until the result looks ok. This is a result from the problem that programming languages just have one string type, instead of keeping track what kind of string it is (subclassing the String) and providing the proper conversions. So the issue should be solved by proper education in the Echo specification with clear examples. Noting that XML text contents are unicode characters, not bytes, is a good example.
Using application/base64 is very bad. It is an encoding scheme, not a content type. You need to know what the mime-type is of the resulting bytestream after decoding.
Can we make the encoding element required? 'Inline' encoding is better than 'none' for HTML, but 'none' is better for plain text, so there's no universal 'default' encoding.
Can we specify the encoding and media type on the TITLE element as well?
Misha,
What encoding element? I see an encoding attribute in his example and from the various discussions it seems Ken realizes that it is unneccesary given that there already is a content-type attribute.
Can you please check my characterization of your proposal at the top of EscapedHtmlDiscussion? Here's what's there now:
Proposal Ruby: There are two forms of encoding:
XML (written when the content is well-formed XML; read when first tag is not in SOMELIST: XML markup is kept inline, ala <content><c>foo</c></content>
quoted (written when above doesn't apply; read when tag is in SOMELIST): markup is quoted as indicated, ala <content><quoted><![CDATA[<c>foo</c>]]></quoted></content>
SOMELIST is initially defined as <quoted> and <base64> but can change at any time breaking all old code.
It's not true that the encoding attribute is unneccesary because there already is a content-type attribute (named simply "type" in current proposals).
The 3 values of encoding Ken proposed, "none", "literal" and "base64" are useful. However, each mime-type has one usually obvious preferred encoding. "none" for text/*, "literal" for */*+xml and "base64" for image/*. So for simplicity Echo should require the preferred encoding for each mimetype, so the encoding attribute is not needed.
Seperate from this discussion I don't think Echo should support base64. Images can simply be published online, and referred to by its URL. This has worked perfectly for RSS.
Yes, I meant encoding attribute. Sorry. It can be called "format" -- "encoding" is overloaded.
I am against auto-detecting default format based on media type because it is error-prone.
The nice thing is that you don't need to detect anything. You need to handle each media type separately anyway. If the encoding is fixed for each media type, it's much easier to parse the Echo feed.
Kevin, one of the clear use-cases for several new blog wares are media content types beyond text/plain, text/html, and application/xhtml+xml.
While the new syntax is comparably similar to description, content:encoded, and xhtml:body for those types, the 'content' proposals are not restricted to just those types.
Re. CDATA, my opinion is that it is inappropriate for an application of XML to make any more of a distinction or preference between CDATA sections and character data than the XML spec does, which is to say, virtually none.
Dare, can you confirm that the primary difference in yours and my approach is that you're using the content type to determine whether the content is inline (XML elements and character data, I use the term "literal") or escaped (nee "quoted")? Thx.
Rereading, I notice Dare writes, "I see an encoding attribute in his example and from the various discussions it seems Ken realizes that it is unnecessary given that there already is a content-type attribute."
That's incorrect. I've got this gut feeling that basing the encoding on the content type won't work -- something along the lines that not all XML content, now and in the future, have a mime type '*+xml' or that mime-only types are the only right way to go.
Tim Bray makes a similar statement ("tying the escaping level to the type= is attractive but doesn't quite work"), but doesn't say why.