UserPreferences

ContentProblems


Content Problems

[JonathanPorter]

Please Read: This discussion was taken from the content page and created to address what appears to be serious issues being overlooked in the general consensus of the content element. The following lists previous comments and additonal comments should be added here as well. Below Discussion are Examples of possible solutions of content element use for embedding. Please add addtional with apporiate headers as necessary.

See also content, ContentDiscussion, SyntaxExtensionMechanism

XML instance

XML fragment

Discussion

[JonathanPorter, RefactorOk] I think that a number of issues with content are being overlooked. Take the following examples below, two of which could equally apply to HTML. People want to use fragments of types such as HTML and XHTML, but there exists no specification for doing so. Other content types wouldn't get this treatmeant and simply could be pulled and parsed. The examples below are overkill but in reality make more sense. You can pull the content, parse it, and create XHTML documents to your liking, say the actual file to be displayed in the embedded browser of an aggregator.

[KenMacLeod, RefactorOk] What if we make the default case of <content> be "a complete resource", and provide rel="fragment" to indicate the typical Entry body that is only some paragraphs or a <div>? This would invert the sense of the HTML and XHTML content section above as it reads now, making full document types defined and also support fragments as used today in syndicated feeds. Changing it to read:

[JonathanPorter, RefactorOk] Anything would be an improvement to the current model which delves in lack of robustness and usability issues. It is these types of things that people overlook and result in serious implementation issues in the future. Your standard parser is going to assume fragments more likely than not when type is application/xhtml+xml or text/html. But there's no consensus on what an HTML or XHTML fragment contains or even that it should be a fragment, this is just a common assumption that works well with the workflow of a typical weblog. In addition, this goes against the idea of embedding content in the first place. Whatever is in content should correspond to the MIME type. Despite popular belief a minimum text/html document would have to look like

<title>Hello World!</title><p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>
The title element is rather superfluous in this context but to truly be text/html (I think in every version) you must have a title element (everything else can be implied) and if it were XHTML you'd have to specify much more in order to be valid applicaion/xhtml+xml document. If the type were say image/svg+xml most likely people wouldn't be so incline to use fragments of data.

While I understand the real world workflow of how and what people want to embed into their feeds the best way for usability and interoperability would be to ensure that all content is XML, external reference, or base64 encoded. An example of XML (Namespace determines acceptability, however would still need to define how foreign namespaces would interact with Atom):

<content xml:lang="en-us"><p xmlns="http://www.w3.org/1999/xhtml">Hello, <em>weblog</em> world! 2 &lt; 4!</p></content>
In any case clarification is necessary.

[DaveMeehan] I agree that HTML fragments cannot be used. There is no way to define a doctype in a fragment, and therefore any client would have to assume HTML 4 or earlier in order to render. This severely limits what the content creator could imply in the markup to simple HTML elements. Use of embedded stylesheets would be rendered not much use for example. If the content type is text/html or application/xhtml+xml, then it should validate as such. I second the idea that if you are going down the fragment route, then it should be xml only, or CDATA that contains no markup (or none that would/should be rendered by the client). I don't see the benefit of a 'standard' if its breaks other standards in its implementation!

[AsbjornUlsberg, RefactorOk] The point of not escaping the XML content (even if it is XHTML) is that you can do XPath queries directly upon the data inside <content>, wihtout any preprocessing and such. With everything escaped (as you have to do if you want to smack a DOCTYPE inside <content>), you cannot directly query the content, and that's a huge drawback. I would rather propose allowing <html> .. </html> inside <content>, and specifying the DTD some other way, or maybe only escaping the DOCTYPE node, and no other nodes inside <content>. I'm not sure. Escaping everything is no option, imo.

[JonathanPorter, RefactorOk] Asbjorn, I agree with your statement on not escaping XML in XML, however, you’re slightly missing the bigger picture. The content element as specified is designed to allow any (not just XML) data to be embedded (or referenced) inside of it. Applications need a neutral way to get that data and it can be processed afterward. In regards to XML special cases can be made but they also have special requirements. If you use a vocabulary that is not specifically designed to be embedded (like XHTML) then you must profile its use in Atom. The XHTML Example Profiles (below) illustrate this; both require very different steps to make say a complete XHTML document (the first requiring nothing more than pulling the html node and its children). The way to work with any data without special consideration is to use complete documents (Full Format examples, below) and specify its content type. Nothing has to be done, simple check to see if your application or handler can support the media type and if so extract the data, then hand it off or parse it.

[AsbjornUlsberg, RefactorOk] It's not that I don't understand the bigger picture, I just think we are talking about different things. I think the content profile idea is good. And I don't care very much about allowing to have XHTML fragments in <content> or not. All I care about is that XML can be treated as XML, inside <content> or not. I don't want to escape valid and well-formed XML. I'm sure everyone understands this. Of the below examples, I like the first one best. If we say that the embedded XHTML document has to be a full document (though without or with an escaped DOCTYPE, or maybe with the XHTML version/DTD put elsewhere), it won't be very hard for the content-producing systems to create this on the fly. The <title> element is of course the same as the title of the content (if it has a title) or the same as for <entry>. Other than that, it's just to smack the real content into <html><body> ... </body></html>. I can't see if and why this is a problem.

[KenMacLeod, RefactorOk] A conforming XHTML document must have a DOCTYPE declaration and should have an XML declaration, both of which are prohibited in XML element content. I would definitely be against a suggestion that would escape the xml or doctype declarations followed immediately by the unescaped XML (ie. some form of "mixed escaping" mode). I might be persuaded to support a 'doctype' attribute to the <content> element, and pass along the xml declaration from the Atom instance into the contained instance, but I don't like those kinds of media-type specific special cases.

[KenMacLeod, RefactorOk] If you look at the current modes of encoding, "XML fragments" (no mode or "xml" mode), "plain text, escaped" ("escaped" mode), or "base64", passing a complete XML document instance most easily falls into the "escaped" or "base64" modes, as the instance-as-content "obviously" can't be passed as an "XML fragment". Note that this isn't much different than if the content was by src= reference, and the XML document instance was fetched via HTTP or found elsewhere in a MIME container (using the cid: URI scheme). [This paragraph is essentially another phrasing of I don't want media-type specific special cases.]

[KenMacLeod, RefactorOk] One might look at the "XML fragment" mode and think that that is a special case for an XML media type, but it's not intended that way. The intent is that Atom content may, intentionally and by design, be an XML fragment. The principle use-case today is the full body content of an entry or comment (xhtml:body in RSS 2.0) (this is where the Atom Entry profile of XHTML says the content must be an xhtml:div, xhtml:span, or either of their content models). Other use cases include those used by ComponentBlog and similar interests, where the XML fragments are better "contained" as a unit or component than made into properties of an Atom entry.

[KenMacLeod, RefactorOk] I find the first XHTML example below to be very confusing from what I believe to be the intended use of XML fragments in Atom entries. Why would an XHTML fragment that is the body of an Atom entry include the <head>, <title>, and potentially other "header" type information that is intended to be expressed in the entry itself? If the example is not for an Atom Entry (or feed summary or title, which also can contain XML content), then does it really make sense for it not to be a complete XHTML document instance (and be like the escaped examples, below)?

[JonathanPorter, RefactorOk] Ken, I agree the first example is probably not the best use of mixing the XML vocabularies, but I'll leave it for reference. There are a number of ways to profile XHTML properly. References include Modularization of XHTML™ (http://www.w3.org/TR/xhtml-modularization). Working examples include An XHTML + MathML + SVG Profile (http://www.w3.org/TR/XHTMLplusMathMLplusSVG) which has XHTML in SVG Host langauge example.

[AsbjornUlsberg, RefactorOk] I also agree with you, Ken. Mixed escaping is of course pretty dumb, but the DOCTYPE is very valuable in many cases. What if we introduce a general way to say "this content belongs to this DTD or XML Schema", e.g. with the attribute "targetSchema"? I know this attribute is reserved, and I'm not sure wether it can be used more than once in an XML document, and I'm pretty sure it can't be used to point to a DTD, but I am sure that you understand what I mean. If the targetSchema points to the XHTML 1.0 Strict DTD, the parser will know what type of HTML it is, and the same goes for other formats like MathML and SVG.

[KenMacLeod, RefactorOk] I'm not 100% sure I'm clear on whether you're talking about an "XML instance" or an "XML fragment" (I've added definitions above). I'm pretty sure you're talking about an XML fragment -- in which case I can say I'm of mixed opinion of whether a seperate schema/doctype reference in the <content> element is necessary vs. looking at the namespace of the contained element (or elements, doesn't have to be one element). On the other hand, we can look again at [WWW]URI media types for the 'type' attribute. If by chance you're talking about an XML instance, that falls into the category above about not special-casing for media-types (application/*xml).

Examples: Possible Solutions

Profiled XML (Defines how an XML Vocabulary should interact with Atom, Relies on Namespaces)

XHTML Example Profiles
<content xml:lang="en-us">
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>Hello World!</title>
    </head>
    <body>
      <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>
    </body>
  </html>
</content>
<content xml:lang="en-us">
  <p xmlns="http://www.w3.org/1999/xhtml">Hello, <em>weblog</em> world! 2 &lt; 4!</p>
</content> 
SVG 1.1 Example (Defined SVG Fragment)
<content xml:lang="en-us">
  <svg width="10cm" height="3cm" viewBox="0 0 1000 300" xmlns="http://www.w3.org/2000/svg" version="1.1">
    <text x="150" y="150" font-family="Verdana" font-size="55" fill="blue" >
      Hello, weblog world! 2 &lt; 4!
    </text>
    <!-- Show outline of canvas using 'rect' element -->
    <rect x="1" y="1" width="998" height="298" fill="none" stroke="blue" stroke-width="2" />
  </svg>
</content>

Full Format (As according to Media Type, Escaped)

<content type="application/xhtml+xml" xml:lang="en-us" xml:space="preserve">&lt;!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
&lt;html xmlns="http://www.w3.org/1999/xhtml">
  &lt;head>
    &lt;title>Hello World!&lt;/title>
  &lt;/head>
  &lt;body>
    &lt;p>Hello, &lt;em>weblog&lt;/em> world! 2 &amp;lt; 4!&lt;/p>
  &lt;/body>
&lt;/html></content>
<content type="application/xhtml+xml" xml:lang="en-us" xml:space="preserve"><![CDATA[<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>
  </body>
</html>]]></content>
<content type="image/svg+xml" xml:lang="en-us" xml:space="preserve"><![CDATA[<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="10cm" height="3cm" viewBox="0 0 1000 300"
     xmlns="http://www.w3.org/2000/svg" version="1.1">
  <desc>Hello Weblog Entry Example</desc>
  <text x="150" y="150" 
        font-family="Verdana" font-size="55" fill="blue" >
    Hello, weblog world! 2 &lt; 4!
  </text>
  <!-- Show outline of canvas using 'rect' element -->
  <rect x="1" y="1" width="998" height="298"
        fill="none" stroke="blue" stroke-width="2" />
</svg>]]></content>

Relative URI's in content

[DougWyatt, RefactorOk] moved from [SuggestionBox] Dave Winer wrote today (2003-07-21) that the first issue the new RSS 2.0 advisory board was to address was how to deal with relative links in items (entries). When generating my own feed I came across a similar question. [WWW]XML Base looks like a possible solution, e.g.

<content type="application/xhtml+xml" xml:lang="en-us" xml:base="http://www.example.com/">
  Pictures are <a href="/Pictures">here</a>.
</content>

(This would only be needed when moving content between sites, as in syndication, not in the context of the editing API)