It’s just data

Escaped HTML discussion

An update from yesterday's position, based on feedback.

There are three forms of expressing content. Illustrated by example:

Rationale:

This is also captured in the EscapedHtmlDiscussion.


Do we need base64 for version 1.0? Seems like an invention to me. RSS is doing fine without it.

If you want to be able to validate this properly, you also need an <xml> signal element:

<xs:complexType name="content">
<xs:choice>
<xs:element name="escaped" type="xs:string" />
<xs:element name="xml" type="anyXML" />
</xs:choice>
</xs:complexType>

<xs:complexType name="anyXML">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:any namespace="##any" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>

Posted by Sjoerd Visscher at

Sam, re attributes vs. elements, I think this is a clear case for using attributes.

The serialization-related interpretation of content of an XML element should not depend on the content itself. (See XML's encoding attribute which is a major pain, IMHO.)

Your proposal is a case in point: it is non-orthogonal in the sense that both the second and third form are also valid first form. If we want to add other "encoding" methods later, we might find that we can't.

Posted by Ziv Caspi at

About an encoding attribute, we can also swap the senses of (what I called) "none" and "literal" so that "XML" is the default ("none") and "escaped" is the encoding of legacy content.

Ziv: +1 on avoiding spec momentum lockin.

Posted by Ken MacLeod at

Ziv, Ken, (or Dare or Joe):

Can somebody write up the appropriate xml schema or relaxNG or dtd for varying the content of what is inside based on an attribute?  And do the same based on an element?

Can somebody write up a regular expression looking for an xml element of a given name vs looking for an attribute?

The essential difference between "my" and "Ken's" proposals boil down to this.  By looking at the tangible implications of the decision, I believe we can come to consensus quickly.

FYI: the reason why I ask the questions above is that I believe I know the answers, and they support the argument for an element.  But please, do the exercise for yourselves and see if you come to the same conclusion.

Sjoerd: it isn't clear to me why one would need an <xml> element inside xml to say that the child of a given node is xml, but if that helps us come to consensus, I'm game.

Posted by Sam Ruby at

Why do we need so many ways of expressing content?  It's either textual (in which case what is wrong with CDATA alone?), or non-textual (in which case base64 is acceptable).

Since we can boil it down to two types, it really doesn't need to be that flexible, you can just have <inline> and <encoded> element types.

The rationale that "people don't read specs" is a flimsy one, imho.  The aggregators should throw out malformed content instead of trying to process it.  Tag soup regexps is something we should be avoiding, not finding workarounds for.

Posted by Jim at

If there is no <xml> element, the only thing you can say in your schema is:

<xs:complexType name="content" type="anyXML" />

<xs:complexType name="anyXML">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:any namespace="##any" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>

AFAIK you cannot say anything about the <escaped> element then.

I still haven't seen a response for the need of base64 encoding. Even a pictureblog doesn't need it. (The image is not going to get from the camera to your entry post form in base64 format.) It's much easier to use other better tools to put your pictures online, and just post the link.

Posted by Sjoerd Visscher at

Dare Obasanjo

Sam,
  Is there really a demand to include base64 encoded stuff in content? I'd keep stuff like that out of V1 of Echo/Pie/whatever and keep at as an idea for v-next

Message from Dare Obasanjo at

Zip, Sjoerd: I've added an <xml> element to my proposal on the wiki.

Sjoerd, Dare: it is my believe that this will be important when we come to the API and archiving portions of the roadmap.  If we don't then this clearly won't survive the v1 cut.  For now, I'm simply content if we accept the premise that the way in which the bytes are going to be expressed isn't necessarily going to remain a binary decision for now and forever.

The more important question to me is attribute vs element.  I claim that it is easier to parse for an element using regular expressions than to scan for an attribute (or a portion thereof, in Dare's proposal).  I also claim that it is easier to create a DTD or schema in which the valid children depend on the name of the parent element instead of some heuristics based on one or more attributes.

Anybody care to support or dispute these claism?

Posted by Sam Ruby at

It is impossible to create a DTD or W3C XML Schema in which the valid children depend on some heuristics based on one or more attributes.

I like attributes much more too, but it is just not an option IMHO.

Posted by Sjoerd Visscher at

Based on the example above, signal elements are only present when the content is escaped. Is this true? If so, then an application cannot determine if an unknown element is an unknown signal element or an unknown content element. This is a problem because unknown signal elements and unknown content elements will be handled differently by many applications.

Posted by Gary Burd at

Dare Obasanjo

Sjoerd &amp; Sam,
  It is an unfortunate aspect of working with XML that people decide to limit their XML vocabularies due to the short sightedness of the W3C XML Schema working group. Quite frankly, I believe attributes work better for describing an elements metadata as opposed to being shoved into its content and also believe one can write a RELAX NG schema that can describe these constraints. Similarly a W3C XML Schema annotated with Schematron assertions could also describe these constraints.

I'm going to download Jing and see if I can write a RELAX NG schema for Tim Bray's proposal. If so I'll post it in a few.

PS: I prefer Tim Bray's proposal to mine. I'll probably withdraw mine and + 1 his instead.

Message from Dare Obasanjo at

Dare, it is indeed unfortunate. However W3C XML Schema is far more widely supported, so Echo should support it too. The only formats that can reasonably use RelaxNG are meta standards like XSLT and RDF, that have no chance of getting a useful W3C XML Schema. Echo does not fall in that category.

Posted by Sjoerd Visscher at

SAMPLE DOCUMENT:

<root>
<content>foo</content>
<content mode="escaped">&lt;em&gt;foo&lt;/em&gt;</content>
<content mode="base64">PGVtPmZvbzwvZW0+</content>
</root>

Posted by Dare Obasanjo at

RELAX NG Schema:

<grammar xmlns="http://relaxng.org/ns/structure/1.0">

  <start>
  <ref name="docRoot"/>
  </start>

<define name="docRoot">
<element name="root" >
<zeroOrMore>
  <choice>
  <element name="content">
  <optional>
  <attribute name="mode">
  <choice>
  <value>escaped</value>
  <value>base64</value>
  </choice>
  </attribute>
  </optional>
  <text />
  </element>
  <element name="content">
  <zeroOrMore>
  <element>
  <anyName/>
  <zeroOrMore>
<choice>
  <attribute>
  <anyName/>
  </attribute>
  <text/>
  <ref name="anyElement"/>
</choice>
  </zeroOrMore>
  </element>
  </zeroOrMore>
  </element>
</choice>
</zeroOrMore>
</element>
</define>

<define name="anyElement">
  <element>
  <anyName/>
  <zeroOrMore>
<choice>
  <attribute>
  <anyName/>
  </attribute>
  <text/>
  <ref name="anyElement"/>
</choice>
  </zeroOrMore>
  </element>
  </define>

</grammar>

Posted by Dare Obasanjo at

XSD Schema:

<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="root">
  <xs:complexType>
  <xs:sequence>
  <xs:element maxOccurs="unbounded" name="content">
  <xs:complexType mixed="true">
  <xs:sequence minOccurs="0">
  <xs:any processContents="skip" maxOccurs="unbounded" />
  </xs:sequence>
  <xs:attribute name="mode" type="modeType" use="optional" />
  </xs:complexType>
  </xs:element>
  </xs:sequence>
  </xs:complexType>
  </xs:element>

  <xs:simpleType name="modeType">
  <xs:restriction base="xs:string">
  <xs:enumeration value="escaped" />
  <xs:enumeration value="base64" /> 
  </xs:restriction>
  </xs:simpleType>

</xs:schema>

Posted by Dare Obasanjo at

Dare Obasanjo

Sjoerd,
  There you go. Schemas for Tim Bray's proposal in RELAX NG and XSD. The RELAX NG schema is stricter than the XSD schema due to limitations of XSD. For instance

<content type="escaped">

I am notescaped

</content>

is not caught by the XSD schema but should be by the RELAX NG schema.

PS: I'm curious, on what platform are there implementations of XSD validators and none for RELAX NG?

Message from Dare Obasanjo at

Now Sam has agreed on the <xml> element, the only reason to choose attributes over elements is taste. Attributes look better. I don't think that warrants dropping the most used xml schema language.

Posted by Sjoerd Visscher at

XPath 2.0/XSLT 2.0/XQuery are built on XML Schema. Every new XML technology from Microsoft is built on XML Schema. I work at Q42, where we're building Xopus, an XML Editor. All our customers use XML Schema, we like to build support for RelagNG, but we've had not requests to do so yet. RelaxNG might be the new cool thing, but the corporate world doesn't use it yet.

Posted by Sjoerd Visscher at

Dare Obasanjo

Sjoerd,
  I prefer elements to attributes but you are right that with attributes W3C XML Schema cannot describe the content model strictly so we are better of using a content model that can be described strictly with both languages.

sigh

Message from Dare Obasanjo at

I'm sighing with you Dare, but if this is the only problem we're going to have with W3C XML Schema we're lucky.

Posted by Sjoerd Visscher at

I am for attributes, and I think trying to put (X)HTML validation into the Echo Schema is a bad idea to begin with. It will make the Echo format that much more fragile. Are you going to add in validation for 'h' and 'section' elements, which are part of XHTML 2.0? How about SVG and MathML elements, depending on the profile of XHTML chosen? Any schema should be for the Echo parts of the format only.

If someone puts a valid element (for their profile of (X)HTML) into an Echo feed that causes their feed to suddenly be invalid, what are they going to do? I would guess they'd go back to escaping their HTML, the opposite direction we want to be going. I would avoid having the Echo schema try to validate anything but the 'Echo' parts of the format.

Of course, this does raise the question of how to indicate which version of (X)HTML you are stuffing into that 'content' element, so you can get the right Dchema or DTD to validate it against.

Posted by joe at

Dare Obasanjo

Joe,
  I don't think anyone is asking for XHTML validation in the Echo spec. What gave you that impression? As for worrying about which versions of XHTML are used I'd suggest that the Echo (we really need a new name) spec should just mandate XHTML 1.0 transitive for V1 and revisit the issue in V2.

Message from Dare Obasanjo at

Who said anything about (X)HTML validation. The point is that both the schema and example feeds clearly indicate what is going on and what the options are. The schema I wrote was a bit buggy, it should have been:

<xs:complexType name="anyXML" mixed="true">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:any namespace="##any" processContents="lax" />
</xs:sequence>
</xs:complexType>

That is: a sequence of any element, mixed with text. processContents="lax" means that the elements don't need to be valid, unless there is a declaration available.

So if somebody want to create a validator that validates echo feeds that may only contain valid XHTML, then he/she can create a new schema, and import the XHTML schema and the Echo schema, and that's it.

Posted by Sjoerd Visscher at

I'm not much of a big-S schema person, but if it's about explicit interoperability with XML Schema, besides allowing anyXML and doing base64/xs:string only at the application level, I'm probably a +1 on it.

I'm not sure regexing is a driving factor, but it would be much less so than XML Schema interoperability.  I tried looking for regex-based RSS parser source for context, but couldn't find one.  I think regex-parsers are still going to have a problem with

  <content><xml><foo><xml>...</xml></foo></xml></content>

Note, of course, we can't use the element name <xml>.  Just a thinko, I'm sure :)

Posted by Ken MacLeod at

I just withdrew my proposal and cast a vote for Tim Bray's proposal.  I encourage others to do likewise.

Posted by Sam Ruby at

I just put a +1 on Sam's proposal.  Seriously guys, we're down to hairsplitting here. 

Also +1 on leaving base64 out of rev1.

Posted by Tim Bray at

heh.  content  updated with both examples.

Posted by Ken MacLeod at

Sam and Tim: give eachother a phone call and make up your minds. It may be hairsplitting, but I don't like "close enough" arguments. Do we choose for one level extra validation or a nice syntax?

Posted by Sjoerd Visscher at

Ken, Sjoerd: please go with the "nice syntax".  Despite the near simultaneity  of Tim and my switches - count the votes.  The result is clear.

Onward.

Posted by Sam Ruby at

base64 support (or other way to include binary data) is important.
There must be a way to upload images and other media using the API, and ideally a way to archive the whole blog including images.  Base64 encoding is the most straightforward solution.

Posted by Misha Dynin at

Dare Obasanjo

Misha,
  Sounds like a good enough argument for its inclusion to me and Tim's proposal already contains it.

Message from Dare Obasanjo at

curious.

Re base64 - what's wrong with a stock href, instead of clogging a feed item with a boatload of image data?  The last thing I want coming down the wire with a feed is a few hundred K of encoded image content.

re content encoding - the XML parser I use handles this just fine without me paying any attention to it at all - escaped, not escaped, CDATA - it all gets handed to me just fine by the VisualWorks Smalltalk XML parser.  Gads, I wouldn't want to have to use regex or deal with figuring out the encoding in application level code, and no one else should want to either.

Posted by James Robertson at

James,
  Go back to Misha's original context: "There must be a way to upload images and other media using the API, and ideally a way to archive the whole blog including images. "
  Echo will not only be a base for aggregation but also for an archive format and for an editing API, as such the ability to include images and other media is required for the format when it is used in those contexts.

Posted by joe at

Forward Motion

There has been a great deal of forward motion in the Echo project today. Looks like the discussion about escaping HTML has come to a conclusion. Other areas that have settled seem to be Author and PermLinks. Things are looking very good for Echo,...

Excerpt from BitWorking at

Re base64, the API, and the format - I really, really don't want to see base64 in feeds - at the same time, I see a lot of value having support for that in the API - I post to my blog that way - I use an url encoded form, but the form elements are encrypted (and the encrypted elements placed in base64 for transmission).  i just think it might well be overdoing consistency to worry about this in the feed format.  IMHO, the feed requirements are, in fact, different than the posting API requirements.  I'm completely not sold on worrying about an archive format - so long as a weblog responds properly to API requests, what difference does it make how entries are stored?

Posted by James Robertson at

James,
  "Archive format" had nothing to do with specifying the how entries are stored on the server. It has to do with the ability to export and import entries from one weblogging system to another.

Posted by Joe at

One of the applications of "profiles" will likely be specifically to discourage the use of binary large objects in feeds.

Posted by Ken MacLeod at

It appears that the latest consensus is to have one of (content), (content mode="escaped"), and (content mode="base64"). I can certainly live with that.

A couple of notes, though:

1. The name "mode" is overloaded in our business. Eventually it will clash with something else we'd like to call "mode". I suggest we rename it "escaping" instead.

2. The default option (if you don't specify @mode|@escaping) should also have its own tag name, say "xml". So (content) and (content escaping="xml") would be equivalent. This makes it easier for tool builders refer to the "default" mode of operation in their code consistently (oops, did I say "mode"?)

Posted by Ziv Caspi at

Escaped HTML discussion. An update on yesterday's position, based on feedback. ... [Sam Ruby]...

Excerpt from André Venter: Dev at

Forward Motion. There has been a great deal of forward motion in the Echo project today. Looks like the discussion about escaping HTML has come to a conclusion. Other areas that have settled seem to be Author and PermLinks. Things are looking very...

Excerpt from André Venter: Dev at

Escaped HTML discussion. An update on yesterday's position, based on feedback. ... [Sam Ruby]...

Excerpt from Keeping track at

I second Ziv's thoughts on using escaping rather than mode.

Posted by Wesley Mason at

Mark Pilgrim changed his RSS feed

Mark has two RSS feeds (both 2.0 flavor) one for his blog and another for comments. removed the funk from his RSS. removed the link element. removed the dc:date element. His dates were off by an hour. removed the rich content. removed the comment...

Excerpt from iBLOGthere4iM at

Add your comment