It’s just data

Planet Musings

Jacques Distler: Introducing my very own Physics-oriented “river of news”: Planet Musings. If you’d rather view it in your feedreader, there’s an atom feed, too

I like environments where contributions are cumulative.  The Universal Feed Parser does a fair amount of data cleansing.  BeautifulSoup does quite a bit more.  All I had left to do was solve the “last mile” problem.  And Planet pulls it all together nicely.

My contributions have also been cumulative in that they have been eagerly adopted by the authors of the relevant code-bases.

On a related note, I’m pleased to see that Mark Pilgrim is considering Ubuntu.  I’m confident that his contributions to that community will also be cumulative — whether it be bug reports. documentation, or code.  Mark attributes his switching to a desire for Freedom.  Freedom is clearly important, but I don’t believe that it is the whole story.

From a W3C perspective Planet Musings is nearly valid.  The border attribute error is, in theory, solvable.  The alt attributes need to be added at the source.

But compliance with the relevant W3C specifications doesn’t tell the whole story.  From a user perspective, there are a number of errors - strange ampersands, hash signs, numbers, and semicolons.  The root cause for this can be traced to a product that is nominally “Free”, but the contributions that numerous have tried to make have not worked out to be cumulative.


Both the border (which is present when it shouldn’t be) and the alt attributes (which are absent when they should be present) are in the syndicated feeds, not the Planet-Musings templates.

One could decide that <img border="..."> should have the border attribute stripped, and a blank alt attribute added: <img alt="">.

That would make Planet-Musing 100% valid, according to the XHTML+MathML DTD, but would not solve anyone’s real-world problems.

I’m still wrapping my head around the idea that Beautiful-Soup + the Universal-Feed-Parser can take a dog’s breakfast of input and turn it into well-formed XHMTL+MathML, with relatively little loss of fidelity.

That opens up quite some possibilities... (Maybe there’s hope for XML Barbie, yet.)

Posted by Jacques Distler at

<img border="0"> should become <img style="border: none">, but I would agree that this is relatively low priority.

Creating empty alt attributes is just... wrong.  That needs to be fixed at the source.

And, in case the “relatively little loss of fidelity” comment is a reference to the characters I mentioned, these problems are due to the UFP attempting to faithfully convey the intentions that are unambiguously (and incorrectly) recorded in those feeds.  That, too, should be fixed at the source.

As to some of the possibilities, watch this space.  ;-)  I’ve only tapped into a small portion of the possibilities of BeautifulSoup.

Posted by Sam Ruby at

[T]he “relatively little loss of fidelity” comment is a reference to the characters I mentioned, these problems are due to the UFP attempting to faithfully convey the intentions that are unambiguously (and incorrectly) recorded in those feeds.

All the double-encoded entities are from the Atom 0.3 feeds (one particular feed, in fact, but the others probably share the same defect) at Wordpress.com. I could try switching to the corresponding RSS 2.0 feeds. (Perhaps I was mistaken in assuming that the Atom 0.3 feeds would have a better shot at fidelity to the author’s intentions.)

But, expecting the Atom 0.3 feeds to get “fixed” is doubtless a losing proposition.

Posted by Jacques Distler at

I wrote:

I could try switching to the corresponding RSS 2.0 feeds.

... which does, indeed, seem to “fix” the problem.

Posted by Jacques Distler at

<img border="0"> should become <img style="border: none">, but I would agree that this is relatively low priority.

Wouldn’t it just be easier to use XHTML 1.0? I would think source feeds are always more likely to use things like the border attribute and font elements rather than styles since the former are better supported in client apps. Converting every single deprecated HTML element and attribute into an equivalent style seems an awful lot of work. What benefit do you get from XHTML 1.1 or is this just a matter of personal preference?

Posted by James Holderness at

Wouldn’t it just be easier to use XHTML 1.0?

If I change my planet from xhtml-math11-f.dtd to simply mathml2.dtd, I go from 45 reported errors to 1887 reported errors.

Posted by Sam Ruby at

Wouldn’t it just be easier to use XHTML 1.0?

If I change my planet from xhtml-math11-f.dtd to simply mathml2.dtd, I go from 45 reported errors to 1887 reported errors.

So I guess that would be a ‘no’. ;)

Posted by James Holderness at

<img border="0"> should become <img style="border: none">, but I would agree that this is relatively low priority.

What would be the tangible benefit?

The easy way to deal with DTD problems in application/xhtml+xml is to get rid of the doctype. The easy way to deal with DTD problems in text/html is to use the HTML5 doctype.

Posted by Henri Sivonen at

What would be the tangible benefit?

Very little; which is why it is a low priority.

The benefit to making changes like this, which stop the Validator from complaining (even though they do not affect the rendering) is that they more-easily allow you to hunt down the errors that do affect the rendering.

In the particular case of Planet-Musings, the total number of Validation errors is low enough that this really doesn’t matter (whatever rendering problems there are, are not due to invalid XHTML).

But try Planet-Intertwingly’s top 100.

Posted by Jacques Distler at

But try Planet-Intertwingly’s top 100.

Someone ought to write a Greasemonkey script for the W3C validator that allowed you to hide/show all instances of a particular type of error.  Each error is inside an [li class="msg_err"] or [li class="msg_warn"].  Each error message is inside a [span class="msg"].  Each error ID is exposed as a query parameter (always errmsg_id, even for warnings) on the feedback link, which is the last [a] within the [li] (errmsg_id=183, for example).  That should be enough to get somebody started.

Posted by Mark at

The benefit to making changes like this, which stop the Validator from complaining (even though they do not affect the rendering) is that they more-easily allow you to hunt down the errors that do affect the rendering.

I have an alternative solution. My post-DTD Web two-point-ohey participation age validator allows the user to supply his own schema (without contaminating the document with schema-specific incantations). Hence, if the user of the validation service makes an educated decision to ignore certain errors with a preset schema, he can make his own copy of the schema and edit it to make select errors go away.

Yes, I know I should supply presets for XHTML+MathML, XHTML+SVG and XHTML+MathML+SVG. They are on my todo list. I while ago I was almost taking care of those but blocked on legal issues. The legal issues are now resolved, so I should just get around to it.

Posted by Henri Sivonen at

Add your comment