It’s just data

Ampersands are Insidious

David Ascher: Ampersands, as I’ve mentioned before, are really nice letters — they have a great typographical history, they’re wonderfully flexible creative outlets for font designers, and they’re quite useful to the writer.  However, they sure do get in the way of a lot of code, especially when it comes to HTML and XML toolchains.

What makes ampersands worse than Unicode is two things:

An example: the src attribute of script element people use to reference the javascript for Flickr.


there is no way to inspect a string and determine a priori whether or not it is entity encoded — unlike punicode and utf-8 where you have a fighting chance to get it right.

Software can make an educated guess that if a string has an ampersand preceding a semicolon then it is likely that there are entities in the markup. I use this heuristic in RSS Bandit and it seems to work pretty well.

Posted by Dare Obasanjo at

If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML.  Is that still the case?  Or are you now making educated guesses there now?

Posted by Sam Ruby at

Must. Control. Fist. Of. Death.

Posted by Mark at

If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML.  Is that still the case?

Nope, still only accept well-formed XML modulo RFC 3023. I thought you meant a string in isolation such as determining whether the content of a title element is plain text or HTML not the specific example that is in David Ascher’s screenshot.

Posted by Dare Obasanjo at

I was referring to David Ascher’s screenshot.  But in any case, different tools selectively applying different standards to different portions of the data with respect to when ill-formedness is acceptable and when it is not; and producing systems that by all outward appearances “seems to work pretty well” — at least, most of the time; well that pretty much is the textbook definition of insidious, isn’t it?

Posted by Sam Ruby at

Here’s a nice educated guess:
s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/&/g

I saw this at Leonard Lin’s blog a while ago and thought it was worth sharing.

Posted by Jonas Galvez at

Jonas: that is nice!

What I have done is the past is:

data = escape(unescape(data))

... relying on unescape to leave alone bare ampersands.

Posted by Sam Ruby at

Even though I did keep a copy of Leonard’s regex around for future banging-on, now you (classically) have two problems.

My favorite example to get bitten by is a typographically challenged company that sells jacks, plugs, and guitar amps: plug&amp co. Entities don’t end with a semicolon, they end with a semicolon or the first character following the ampersand that is not a NAME char in your SGML declaration.

Posted by Phil Ringnalda at

After ampersands, entity references

Sam reads his trackbacks — for interesting reactions on my last post, see his comment and the comments on his comment. I haven’t even mentioned that my blog title (as defined in the blogging software I’m using) used to be david ascher’s...

Excerpt from david ascher at

Sam Ruby: Ampersands are Insidious

Wayne Burkett : Sam Ruby: Ampersands are Insidious - Including a better regex for matching entity encoded strings than the one I used in del.icio.us.pl....

Excerpt from HotLinks - Level 1 at

Flikr is incorrect - it should be Flickr ;) Feel free to delete this comment.

Posted by Simon at

Simon: fixed.  Thanks!

Posted by Sam Ruby at

Sam Ruby- Ampersands are Insidious

Go there…...

Excerpt from bytehead's link blog at

Sam Ruby: Ampersands are Insidious

Including a better regex for matching entity encoded strings than the one I used in del.icio.us.pl....

Excerpt from del.icio.us/dionidium/regex at

Add your comment