Joel
Spolsky: It would be convenient if you could put the
Content-Type of the HTML file right in the HTML file itself, using
some kind of special tag. Of course this drove purists crazy... how
can you read the HTML file until you know what encoding it's
in?!
The same thing is true of XML. It is easy to eliminate
what remains the
single largest source of invalid feeds. Take a moment and
ensure that the first line of your feeds look like the
following:
<?xml version="1.0" encoding="iso-8859-1"?>
Or, if you happen to be on a Windows platform and have a
tendency to cut and paste content that may have so-called smart
quotes in it, use the following:
<?xml version="1.0" encoding="windows-1252"?>
Finally, while I don't know of a single feed parser that cares
about the charset specified on the Content-Type header, it doesn't
hurt to make it match.
Technically, if
you use text/xml, the default is supposed to override what is
specified in the document itself. You can avoid this by simply
using application/xml instead.
I don't know of a single feed consumer that cares about the MIME type, either.
this is related to one of my biggest recent peeves with php: none of the released versions let you leave the encoding detection up to the xml parser, and the default is iso-8859-1.
if you want to handle xml in php correctly, you have to sniff out the encoding directly from the file before you can create your xml parser object.
i just recently changed my feed to the iso-8859-1 encoding. i'd quoted some german and it blew up my feed.
but i was wondering, why doesn't utf-8 work? i was under the naive impression that utf-8 was a magic all inclusive encoding that all (or nearly all) other encodings should fall under.
i don't know if this is a super dumb question or not, if it is perhaps someone could just point me to a good book or web site on the topic?
A short summary: if your data has bytes in it which have values in it that are greater than 127 (0x7F), then you need to be aware of encoding. Pretty much everything that exists in other encodings than UTF-8 exist someplace in UTF-8, but will undoubtably be represented by a different series of bytes.
As long as the German you quoted is in the same encoding as your feed, then you are fine. If not, you need to convert it to the same encoding.
sam ruby was kind enough to point me to this article on characters vs. bytes. i obviously have no idea what i'm doing! now i have to do my homework!...