David Ascher:
Ampersands, as I’ve mentioned before, are really nice
letters — they have a great typographical history,
they’re wonderfully flexible creative outlets for font
designers, and they’re quite useful to the writer.
However, they sure do get in the way of a lot of code, especially
when it comes to HTML and XML toolchains.
What makes ampersands worse than Unicode is two things:
there is no way to inspect a string and determine a
priori whether or not it is entity encoded — unlike
punicode and utf-8 where you have a fighting chance to get it
right.
most consuming software is too forgiving, and will compensate
for lack of appropriate encoding.
An example: the src attribute of
script element people use to reference the javascript
for Flickr.
there is no way to inspect a string and determine a priori whether or not it is entity encoded — unlike punicode and utf-8 where you have a fighting chance to get it right.
Software can make an educated guess that if a string has an ampersand preceding a semicolon then it is likely that there are entities in the markup. I use this heuristic in RSS Bandit and it seems to work pretty well.
If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML. Is that still the case? Or are you now making educated guesses there now?
If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML. Is that still the case?
Nope, still only accept well-formed XML modulo RFC 3023. I thought you meant a string in isolation such as determining whether the content of a title element is plain text or HTML not the specific example that is in David Ascher’s screenshot.
I was referring to David Ascher’s screenshot. But in any case, different tools selectively applying different standards to different portions of the data with respect to when ill-formedness is acceptable and when it is not; and producing systems that by all outward appearances “seems to work pretty well” — at least, most of the time; well that pretty much is the textbook definition of insidious, isn’t it?
Even though I did keep a copy of Leonard’s regex around for future banging-on, now you (classically) have two problems.
My favorite example to get bitten by is a typographically challenged company that sells jacks, plugs, and guitar amps: plug& co. Entities don’t end with a semicolon, they end with a semicolon or the first character following the ampersand that is not a NAME char in your SGML declaration.
Sam reads his trackbacks — for interesting reactions on my last post, see his comment and the comments on his comment. I haven’t even mentioned that my blog title (as defined in the blogging software I’m using) used to be david ascher’s...
Wayne Burkett : Sam Ruby: Ampersands are Insidious - Including a better regex for matching entity encoded strings than the one I used in del.icio.us.pl....