It’s just data

Trust, but verify some more

Last week, I created a nightly job to verify that my inputs are clean, well formed XML.  That took care of my inputs, but it didn't verify the process by which the web pages were created.

I've since added some code to verify that each of the pages in the cache of pages served in the past 24 hours are well formed and valid XHTML.  This uncovered an interesting boundary case that I hadn't considered.

Specifically, this blog entry.  Notice that the title has two consecutive dashes in it.  Seem inocuous?  Well, the title is repeated in the trackback metadata, and the trackback metadata is contained in an XML comment, and consecutive dashes are illegal in the body of an XML comment.

Unfortunately, since the W3C validator doesn't allow trackback metadata to be directly nested in the XHTML, I will continue to place this information inside a comment.  So, in the case I happen to have consecutive dashes inside a title, I now replace the dashes with numeric character references.

I started using — instead of two hyphens. Besides being valid in more situations, it is probably a more proper use of punctuation  in general (and hey, maybe less ambiguous semantically).

Posted by Jay FIenberg at

Although your content is being served with the correct MIME type, you are still sending out:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>

Why not change that to

<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=ISO-8859-1"/>

Posted by Basil Crow at

Jay, good suggestion - for the future.  Meanwhile, I now have some defensive code in place.

Basil, good catch.  But I'm concerned that such a meta tag would cause IE to throw up a hairball or something.  In any case, I see no need for that particular meta tag in this instance, so I am removing it.  Given the way I cache pages, it will be a few days before this ripples through all of my entries.

Posted by Sam Ruby at

From a W3C note about Media types:


Note that a meta http-equiv statement will not be recognized by XML processors, and authors SHOULD NOT include such a statement in an XHTML document served as 'application/xml' (and 'application/xhtml+xml' as well for that matter).


Posted by Anne at

Is my weblog well formed?

Can I ever be sure?... [more]

Trackback from Sam Ruby


Jay:  started using &mdash; instead of two hyphens.

Sam: Jay, good suggestion - for the future.  Meanwhile, I now have some defensive code in place.

Can &mdash; be allowed in an xml without being defined in the DTD ? (are you planning to include DTD definitions for all such entities in your feed?) I believe.. the NCR form is right way to go!

Posted by kg at

kg: No, (no)

The way I represent "—" in my feed is with "&#8212;".

Posted by Sam Ruby at

Sam Ruby

I validate my XHTML not with the W3C validator, but by using a validating XHTML parser (libxml2) with a locally cached set of XHTML DTDs....

Excerpt from phil ringnalda dot com: New MT plugin author: Comments at

Add your comment