David Ascher: Ampersands, as I’ve mentioned before, are really nice letters — they have a great typographical history, they’re wonderfully flexible creative outlets for font designers, and they’re quite useful to the writer. However, they sure do get in the way of a lot of code, especially when it comes to HTML and XML toolchains.
What makes ampersands worse than Unicode is two things:
An example: the src
attribute of
script
element people use to reference the javascript
for Flickr.
there is no way to inspect a string and determine a priori whether or not it is entity encoded — unlike punicode and utf-8 where you have a fighting chance to get it right.
Software can make an educated guess that if a string has an ampersand preceding a semicolon then it is likely that there are entities in the markup. I use this heuristic in RSS Bandit and it seems to work pretty well.
If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML. Is that still the case?
Nope, still only accept well-formed XML modulo RFC 3023. I thought you meant a string in isolation such as determining whether the content of a title element is plain text or HTML not the specific example that is in David Ascher’s screenshot.
Here’s a nice educated guess:
s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/&/g
I saw this at Leonard Lin’s blog a while ago and thought it was worth sharing.
Jonas: that is nice!
What I have done is the past is:
data = escape(unescape(data))
... relying on unescape to leave alone bare ampersands.
Even though I did keep a copy of Leonard’s regex around for future banging-on, now you (classically) have two problems.
My favorite example to get bitten by is a typographically challenged company that sells jacks, plugs, and guitar amps: plug& co. Entities don’t end with a semicolon, they end with a semicolon or the first character following the ampersand that is not a NAME char in your SGML declaration.