Sam Ruby

Cataloging Venial Sins

2009-02-24T11:46:11-05:00

rubys: So do you agree with the position that, in the face of a specification with defined error handling, there should be no MUST-level document conformance critera?

Rob Sayre noticed a related issue. Browsers are motivated to display something for every response they get, and HTML 5 endeavors to make all browsers consistent in their behavior.

For the moment, let’s ignore Gödel and let’s ignore bugs and lets look at some specific cases.

If every conforming implementation of HTML5 processed documents which start with a UTF-32 LE BOM followed by a WIN-1252 encoded document identically (i.e., there are ZERO interop differences), does it make sense to call such documents non-conforming in an RFC 2119 sense of the word? A less dramatic example would be a date with a value of mañana.

On the other hand, declaring such documents as conformant would be completely counter-productive.

And then there is the truly sublime.

Perhaps an argument could still be made that SHOULD NOT be used, but that case is weakened significantly by allowing style attributes.

Cataloging Venial Sins

2009-02-24T12:37:12-05:00

There is a subtle (but real) distinction between consistent error-handling, and documents behaving “as intended”.

Whatever error-handling you specify for non-conforming dates (to pick one of your examples), a client program is unlikely to be able to extract a date of “mañana” and use it to add an item to the user’s Calendar.

Similarly, if (for example) SVG-in-HTML5 does not make it into the final Spec, then included in an HTML5 document would presumably not be rendered (though it would be handled consistently across browsers). It would, therefore, make sense to flag it as non-conforming (as you would ).

Cataloging Venial Sins

2009-02-24T12:43:46-05:00

Ah, I was remiss in not mentioning Conformance depends on author’s intent as an issue.

Cataloging Venial Sins

2009-02-24T12:48:16-05:00

“defined error handling” = parsers but “MUST-level Document conformance” = authors.

My written language in English is quite loose, though everyone mostly understands it and processes it. Cool. That doesn’t mean that English grammar and spelling rules should be abandoned.

Cataloging Venial Sins

2009-02-24T12:50:35-05:00

Ah, yes. “As intended” is probably a poor way to phrase it. “Interoperably” better describes the above examples.

There is a subtle (but real) distinction between consistent error-handling, and documents behaving “interoperably”.

Cataloging Venial Sins

2009-02-24T13:26:55-05:00

Karl: But if everybody mostly understands and processes font why should it be illegal, particularly if style attributes are legal?

Jacques: But if everybody processes WIN-1252 data marked at UTF-32 LE identically, is there an interoperability issue? Perhaps intent is the right criteria after all?

Cataloging Venial Sins

2009-02-24T16:18:43-05:00

Just remembered a discussion we had in the QA Working Group and the use of IETF RFC 2119. Using the RFC 2119 words should be usually done in a very precise context indeed. Section 6. Guidance in the use of these Imperatives is clear about this:

Imperatives of the type defined in this memo must be used with care and sparingly. In particular, they MUST only be used where it is actually required for interoperation or to limit behavior which has potential for causing harm (e.g., limiting retransmisssions) For example, they must not be used to try to impose a particular method on implementors where the method is not required for interoperability.

Unfortunately RFC 2119 falls into its own trap by using MUST in the prose of this paragraph. If we consider that in interoperation is only about software we can indeep remove assertions such as a BLOCKQUOTE element must contain a citation

There is not only way of defining Conformance for a specification. In fact, you can choose whatever model you want as long as it is explained and consistent with what you would like to achieve. Ian has added another conformance criteria for the prose of the specification.

Requirements phrased in the imperative as part of algorithms (such as “strip any leading space characters” or “return false and abort these steps") are to be interpreted with the meaning of the key word ("must”, “should”, “may”, etc) used in introducing the algorithm.

Unfortunately, there is a missing conformance form in the specification. Example, the blockquote element. First we get an explanation:

The blockquote element represents a section that is quoted from another source.

Then a requirement:

Content inside a blockquote MUST be quoted from another source, whose address, if it has one, SHOULD be cited in the cite attribute.

This could be perfectly reworded and be conformant such as:

[Doc][M] Content inside a blockquote is a quote from another source.
[Doc][O] Authors are encouraged to write the cite attribute if the source reference is known and expressible under the form of a URI.

Then later there is:

If the cite attribute is present, it MUST be a valid URL.

Which is good because it is about interoperation for software using the value of the cite attribute, though I would rephrase it as:

The cite attribute value MUST be a valid URI.

etc. We would just have to declare in the conformance section

Prose flagged with “[Doc]” is a requirement for Document Conformance.
Mandatory is marked “[M]” and Optional is marked “[O]”.

What I meant is that it is perfectly feasible to create a class of conformance which is not using the RFC 2119 words but make sense for a specific class of products, be it authors or knowledge systems or whatever needs to follow the requirements. From the QA Framework, Specification Guidelines:

Use a consistent style for conformance requirements and explain how to distinguish them.

Cataloging Venial Sins

2009-02-24T16:27:02-05:00

About the specific font element, have all issues (internationalization and accessibility) which existed in the past been solved?

Cataloging Venial Sins

2009-02-24T18:17:10-05:00

Nice summary of the heart of the problems with HTML5. If you specify “error” handing that precisely, it’s no longer on the outside — it’s specified behaviour. You are specifying them as conformant, just using misleading terminology.

The hard part gets to the point where you get errors outside of those well-defined bounds; what happens then?

Cataloging Venial Sins

2009-02-24T18:49:21-05:00

The hard part gets to the point where you get errors outside of those well-defined bounds; what happens then?

You don’t.

There’s a deterministic process for turning an arbitrary sequence of bytes into a DOM.

... just using misleading terminology.

I don’t think it’s misleading at all. The vast majority of random byte sequences will lead to garbage as the output DOM. It’s pretty clear-cut to call such byte sequences “non-conforming.”

The edge cases involve DOMs which are not complete garbage, but which are not really interoperable either (e.g., which contain a date value of "mañana").

@Sam:

Perhaps your Windows-1252 example is one where there are no interoperability issues. I don’t know enough to judge.

Cataloging Venial Sins

2009-02-24T18:56:28-05:00

Jacques, here’s a better example then:

foo

Such a sequence in no way produced garbage in a DOM, yet is something that (arguably) should be flagged and discouraged.

I do believe in well specified error recovery, at least in the scope of HTML. But for the class of errors that should be flagged, we should be able to answer the obvious question: why? And in general, RFC 2119 isn’t something that helps in this situation. Somehow we need to find a way to expand our vocabulary.

Cataloging Venial Sins

2009-02-24T19:04:41-05:00

The vast majority of random byte sequences will lead to garbage as the output DOM.

Well, there’s more than one phase. To get a DOM, you’ll need to convert to unicode at some point.

Do the vast majority of random unicode character sequences lead to garbage DOMs? Well, I bet a lot of them will be untitled documents consisting of a single text node, though long enough random sequences will probably encounter a tag at some point.

Cataloging Venial Sins

2009-02-24T21:34:31-05:00

Jacques, here’s a better example then: foo

Right. I don’t know how to exclude that on interoperability grounds. Though it seems likely that the DOM that is produced either

a) is not the one that the author intended or
b) is the one that the author intended, but the author mistakenly transposed several bytes in producing the above string.

I think it is reasonable to assert that authors are not expected to know the HTML5 parsing algorithm, hence they should not be relying on that algorithm’s error correction to produce the desired result.

I’m not sure what (other than mistyping, or a malicious knowledge of the HTML5 parsing algorithm) would cause an author to produce that particular example. But I can easily see an author typing

foo bar baz

with the intention that the first word be bold, the second bold-italic, and the third italic. The actual result would then come as something of a surprise.

Cataloging Venial Sins

2009-02-24T21:34:46-05:00

Given infinite time and resources, random sequences would eventually produce not just a tag, but web pages and eventually the entire contents of the Internet. I am a rather simple fellow and not too bright author of web pages. I will do whatever a validator will tell me to do to a page and hope that once a page is valid at some fixed point in finite time, it will at least degrade reasonably well when something changes. As when a server I do not control shifts from iso-8859-1 encoding to utf-8 encoding for HTML in the HTTP header. My “must” is whatever the validator says to do.

Cataloging Venial Sins

2009-02-24T22:05:53-05:00

To recover something you need to know what you want to recover and how aka a reference. The spec is being written on an assumption of what is right. This assumption comes from years of others specs and implementations. It is not free will and random.

Cataloging Venial Sins

2009-02-24T22:18:37-05:00

The spec is being written on an assumption of what is right.

My blog has a profile attribute. Seems to work ok.

Cataloging Venial Sins

2009-02-25T05:30:29-05:00

I think it isn’t particularly useful to approach this from the definition given in RFC 2119 when that RFC wasn’t written for application in a context like this.

I think it’s more useful to consider what (if any) machine-checkable authoring requirements one would want HTML 5 to make such that conforming validators flag them as errors and to consider what non-machine-checkable requirements one would want HTML 5 to make about authoring.

If the answer is that one would like a non-empty set of requirements on authoring and if it happens that there’s a cowpath of applying RFC 2119 terms to expressing such requirements, I think it would be more useful to pave the cowpath as RFC 2119-5 than to conclude that such requirements have to be expressed without using RFC 2119 language just because RFC 2119 is rigid.

Doing the latter would be similar to registering application/* types for textual formats for use on the Web just because an old RFC makes SMTP-oriented stipulations about text/* and those stipulations make no practical sense in the HTTP context (or even on file systems).

Cataloging Venial Sins

2009-02-25T05:40:24-05:00

@rob: I have an electrical plug under my eyes, on the table. It seems to work. seems to work for who and what? Define your class of products. See QA Framework: Specifications Guidelines.

@henri: exactly. Define your requirements formalism in a way that it is useful for your technology.

Cataloging Venial Sins

2009-02-25T06:55:50-05:00

I think it’s more useful to consider what (if any) machine-checkable authoring requirements one would want HTML 5 to make such that conforming validators flag them as errors and to consider what non-machine-checkable requirements one would want HTML 5 to make about authoring.

Agreed, though I would like to reserve judgment for the moment on what constitutes an error and what constitutes a warning for reasons similar to your position on badges.

If the answer is that one would like a non-empty set of requirements on authoring and if it happens that there’s a cowpath of applying RFC 2119 terms to expressing such requirements, I think it would be more useful to pave the cowpath as RFC 2119-5 than to conclude that such requirements have to be expressed without using RFC 2119 language just because RFC 2119 is rigid.

Dates in HTML5 appear to conform to RFC-3339 (and therefore W3CDTF and ISO-8601) but are not defined in those terms. Similarly, it makes sense for Ian’s draft to use the common English words must, should, and may, but is is confusing, counterproductive, and incorrect for that draft to define the use of those terms with a reference to RFC 2119.

To the extent I understand it, Rob’s document intends to follow the definition of RFC 2119, so such a reference would be appropriate there.

This difference in approach means that Rob and Ian will likely end up in different places. Several possibilities exist, including: Rob’s document has no value, Rob’s document is a logical stepping stone while we await the remainder of what is envisioned for HTML5, and Rob’s document is incomplete without a BCP or an Authoring Guide.

Cataloging Venial Sins

2009-02-26T03:39:11-05:00

Dates in HTML5 appear to conform to RFC-3339

Don’t they allow times without dates? I think I noticed that while I was deleting the section.

Cataloging Venial Sins

2009-02-26T23:07:20-05:00

Sam Ruby, I would like to keep up on the evolution of HTML5, but I find that I’m always a bit short of the time that seems to be needed. I usually lack the time to read through the irc-logs/whatwg. I’m trying to keep up on this subject by reading your blog, Anne van Kesteren’s blog, and, once in a while, the blogs of some of the people who post comments here.

For someone like me, who is having trouble keeping up, your intent, regarding HTML5, is sometimes hard to discern. A remark such as “And then there is the truly sublime” is fairly cryptic, even after I clicked through and read perhaps half the conversation.

I don’t mean any of this as criticism, I’m merely saying, for an outsider such as myself, it is hard to keep up. You are good about wrapping your certainties inside of several layers of qualifiers. I appreciate how careful you are being, but I also find those qualifiers leave me guessing as to your real meaning. For instance, I’m not sure of the point you are making here:

“Perhaps an argument could still be made that SHOULD NOT be used, but that case is weakened significantly by allowing style attributes.”

Is it your overall feeling, then, that HTML5 should allow as much possible, and merely give mild warnings over things that are regarded as bad behavior?

Cataloging Venial Sins

2009-02-27T07:14:08-05:00

Criticism welcome. My primary focus at the moment is not to provide an overall status of the state of HTML5, but rather to get some level of order in the working group.

The two statements that I pointed to by hsivonen are a clear misapplication of RFC 2119.

HTML5 conformance checkers should give clear direction, not mild warnings. That’s not the issue. The issue is that reasonable people can disagree on what constitutes “bad behavior”. HTML5 takes a strong stand on font, but allows the exact same thing to be expressed as style attributes. Example. Such a position is clearly inconsistent. The very same hsivonen made this same point.

Similar situations exist for other attributes.