intertwingly

It’s just data

XML Cleansing


If you accept data from various sources, and want to produce XML that can be consumed, one thing you need to be careful about is character set issues.

On the input side, people often lie or make mistakes.  Many don’t specify an encoding, and while XML’s default is utf-8, it is common to find iso-8859-1 or even win-1252 data.

On the output side, if you want to produce something that can be consumed, then it behooves you to be aware that the quality of XML parsers out there varies widely.  Many of the initial feed aggregators were no better than regular expressions, simply ignoring character set issues and slapping descriptions into HTML.  While there has been much improvement on this front, many still fall back to such behaviors when the encounter other, unrelated, problems.

Carrying forward the experience I gained with my existing Python implementation of my weblog, I’ve come up with xchar.rb: some data, two small methods, and six tests.

One potential use of this would be in Ruby’s XML Builder:

class Builder::XmlMarkup < Builder::XmlBase
  def _escape(text)
    text.to_xs
  end
end