It’s just data

Plain Text

Joel Spolsky: It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?!

The same thing is true of XML.  It is easy to eliminate what remains the single largest source of invalid feeds.  Take a moment and ensure that the first line of your feeds look like the following:

<?xml version="1.0" encoding="iso-8859-1"?>

Or, if you happen to be on a Windows platform and have a tendency to cut and paste content that may have so-called smart quotes in it, use the following:

<?xml version="1.0" encoding="windows-1252"?>

Finally, while I don't know of a single feed parser that cares about the charset specified on the Content-Type header, it doesn't hurt to make it match.  Technically, if you use text/xml, the default is supposed to override what is specified in the document itself.  You can avoid this by simply using application/xml instead.


I don't know of a single feed consumer that cares about the MIME type, either.

Posted by Mark at

Shouldn't that be windows-1252 for most West European languages? windows-1255 is used for Hebrew, at least according to http://www.w3.org/International/O-charset-list.html

Posted by Lauren at

Lauren: Good catch!  Fixed.

Posted by Sam Ruby at

this is related to one of my biggest recent peeves with php: none of the released versions let you leave the encoding detection up to the xml parser, and the default is iso-8859-1.

if you want to handle xml in php correctly, you have to sniff out the encoding directly from the file before you can create your xml parser object.

Posted by jim winstead at

i just recently changed my feed to the iso-8859-1 encoding. i'd quoted some german and it blew up my feed.

but i was wondering, why doesn't utf-8 work? i was under the naive impression that utf-8 was a magic all inclusive encoding that all (or nearly all) other encodings should fall under.

i don't know if this is a super dumb question or not, if it is perhaps someone could just point me to a good book or web site on the topic?

thanks in advance...

Posted by jonvon at

jonvon: Characters vs. Bytes has a lot of good information in it.

A short summary: if your data has bytes in it which have values in it that are greater than 127 (0x7F), then you need to be aware of encoding.  Pretty much everything that exists in other encodings than UTF-8 exist someplace in UTF-8, but will undoubtably be represented by a different series of bytes.

As long as the German you quoted is in the same encoding as your feed, then you are fine.  If not, you need to convert it to the same encoding.

Posted by Sam Ruby at

thanks a lot sam!

Posted by jonvon at

character set thing

sam ruby was kind enough to point me to this article on characters vs. bytes. i obviously have no idea what i'm doing! now i have to do my homework!...

Excerpt from jonvon.freedomblog at

Add your comment