Sam Ruby

Ampersands are Insidious

2005-03-03T04:13:55-08:00

David Ascher: Ampersands, as I’ve mentioned before, are really nice letters — they have a great typographical history, they’re wonderfully flexible creative outlets for font designers, and they’re quite useful to the writer. However, they sure do get in the way of a lot of code, especially when it comes to HTML and XML toolchains.

What makes ampersands worse than Unicode is two things:

there is no way to inspect a string and determine a priori whether or not it is entity encoded — unlike punicode and utf-8 where you have a fighting chance to get it right.
most consuming software is too forgiving, and will compensate for lack of appropriate encoding.

An example: the src attribute of script element people use to reference the javascript for Flickr.

Ampersands are Insidious

2005-03-03T05:52:22-08:00

there is no way to inspect a string and determine a priori whether or not it is entity encoded — unlike punicode and utf-8 where you have a fighting chance to get it right.

Software can make an educated guess that if a string has an ampersand preceding a semicolon then it is likely that there are entities in the markup. I use this heuristic in RSS Bandit and it seems to work pretty well.

Ampersands are Insidious

2005-03-03T06:19:15-08:00

If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML. Is that still the case? Or are you now making educated guesses there now?

Ampersands are Insidious

2005-03-03T06:41:12-08:00

Must. Control. Fist. Of. Death.

Ampersands are Insidious

2005-03-03T07:38:34-08:00

If I recall correctly, RSS Bandit used to reject feeds that are not well formed XML. Is that still the case?

Nope, still only accept well-formed XML modulo RFC 3023. I thought you meant a string in isolation such as determining whether the content of a title element is plain text or HTML not the specific example that is in David Ascher’s screenshot.

Ampersands are Insidious

2005-03-03T07:52:04-08:00

I was referring to David Ascher’s screenshot. But in any case, different tools selectively applying different standards to different portions of the data with respect to when ill-formedness is acceptable and when it is not; and producing systems that by all outward appearances “seems to work pretty well” — at least, most of the time; well that pretty much is the textbook definition of insidious, isn’t it?

Ampersands are Insidious

2005-03-03T08:18:38-08:00

Here’s a nice educated guess:
s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/&/g

I saw this at Leonard Lin’s blog a while ago and thought it was worth sharing.

Ampersands are Insidious

2005-03-03T08:49:35-08:00

Jonas: that is nice!

What I have done is the past is:

data = escape(unescape(data))

... relying on unescape to leave alone bare ampersands.

Ampersands are Insidious

2005-03-03T09:05:24-08:00

Even though I did keep a copy of Leonard’s regex around for future banging-on, now you (classically) have two problems.

My favorite example to get bitten by is a typographically challenged company that sells jacks, plugs, and guitar amps: plug& co. Entities don’t end with a semicolon, they end with a semicolon or the first character following the ampersand that is not a NAME char in your SGML declaration.

After ampersands, entity references

2005-03-03T12:15:42-08:00

Sam reads his trackbacks — for interesting reactions on my last post, see his comment and the comments on his comment. I haven’t even mentioned that my blog title (as defined in the blogging software I’m using) used to be david ascher’s...

Sam Ruby: Ampersands are Insidious

2005-03-03T22:15:21-08:00

Wayne Burkett : Sam Ruby: Ampersands are Insidious - Including a better regex for matching entity encoded strings than the one I used in del.icio.us.pl....

Ampersands are Insidious

2005-03-05T14:34:26-08:00

Flikr is incorrect - it should be Flickr ;) Feel free to delete this comment.

Ampersands are Insidious

2005-03-05T15:17:29-08:00

Simon: fixed. Thanks!

Sam Ruby- Ampersands are Insidious

2005-03-20T19:15:36-08:00

Go there…...

Sam Ruby: Ampersands are Insidious

2005-05-28T22:15:17-07:00

Including a better regex for matching entity encoded strings than the one I used in del.icio.us.pl....

Ampersands are Insidious

2018-02-19T21:58:59-08:00

Microsoft provides the best guide to answer your question and also provide virtual agent to make you show the path if you are stuck somewhere or needs any suggestion regarding your query , for this you can easily use the chat enabled interface and also avail call-support for your issues.

Ampersands are Insidious

2018-05-02T04:02:44-07:00

Hi, great to see your website. I like the content and the research done behind every aspect of your blog. It looks great and very knowledgeable. Keep it up the good work.

Ampersands are Insidious

2018-08-23T09:21:26-07:00

Your blog website provided us with useful information to execute with. Each & every recommendations of your website are awesome. Thanks a lot for talking about.

Ampersands are Insidious

2019-05-14T09:11:12-07:00

Everybody has the right to ask how to enable bluetooth in windows 10 and they deserves reply.