Sam Ruby

XML Cleansing

2005-09-28T14:21:23-07:00

If you accept data from various sources, and want to produce XML that can be consumed, one thing you need to be careful about is character set issues.

On the input side, people often lie or make mistakes. Many don’t specify an encoding, and while XML’s default is utf-8, it is common to find iso-8859-1 or even win-1252 data.

On the output side, if you want to produce something that can be consumed, then it behooves you to be aware that the quality of XML parsers out there varies widely. Many of the initial feed aggregators were no better than regular expressions, simply ignoring character set issues and slapping descriptions into HTML. While there has been much improvement on this front, many still fall back to such behaviors when the encounter other, unrelated, problems.

Carrying forward the experience I gained with my existing Python implementation of my weblog, I’ve come up with xchar.rb: some data, two small methods, and six tests.

One potential use of this would be in Ruby’s XML Builder:

class Builder::XmlMarkup < Builder::XmlBase
  def _escape(text)
    text.to_xs
  end
end

XML Cleansing

2005-09-29T08:55:11-07:00

“May don’t specify an encoding”

That’s ‘Many’, not ‘May’, right?

XML Cleansing

2005-09-29T08:56:52-07:00

Dilip: fixed. Thanks!

XML Cleansing

2005-09-29T09:38:04-07:00

one thing you need to be careful about is character set issues.

That should be character encoding issues, right?

Sam Ruby: XML Cleansing

2005-09-29T17:15:09-07:00

[link]...

Python Web goodies

2005-09-30T10:45:29-07:00

There has been lots of progress in the python world recently. I keep opening posts but not getting time to write them. So this is more a list than comment (well that is how it started, now grown a bit......

RDF as XML

2005-09-30T20:15:09-07:00

Over the last week, Planet RDF has seen more than a few posts and comments on the RDF/XML serialisation syntax, most of them looking into its (almost not enumerable) possible variations. Danny Ayers has a great overview with reference to the...

Dragons be gone

2005-11-02T11:43:32-08:00

Luckily, I’m outside of arms reach. You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8. Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Micro...

Sam Ruby: XML Cleansing

2006-04-27T20:21:18-07:00

Someone at Smarking has bookmarked your post....

XML Cleansing

2006-05-10T08:58:51-07:00

Sooo, this is why all my new Rails apps produce unreadable XML with the new Builder...

Sam, can you guess in three steps why this was a very bad solution (as I figure you coined it)?

XML Cleansing

2006-05-10T09:13:59-07:00

Sam, can you guess in three steps why this was a very bad solution

Encoding other than utf-8?

I can point to Rails apps that produced unreadable XML with the old builder.

If you provide more details, I can construct a test case and a fix.

XML Cleansing

2006-05-10T10:41:46-07:00

Here’s a patch which assumes that people who specify an encoding other than utf-8 know what they are doing.

XML Cleansing

2006-05-10T12:55:04-07:00

See the mail. Case in point - in my system all that goes out is raw, bona fide UTF-8. I would like to have it unescaped in my XML output as well. Right now every Russian letter I output via Builder gets escaped.

XML Cleansing

2006-05-10T14:06:45-07:00

Just to be clear, it is not exactly raw. There are a number of characters that must be escaped. < and &, to name but two.

I just want to make sure that your issue is a cosmetic one, not a functional one. I guessed before what your issue might be, and apparently I guessed wrong. I’d like to not guess any more.

If it turns out that your issue is just the cosmetic one, I am quite prepared to make a patch that assumes that people that require 'jcode' and set $KCODE='u' are prepared to handle utf-8. Everyone else will get a bulkier, but safer, result.

XML Cleansing

2006-05-10T15:27:01-07:00

Well, for me the issue is mostly cosmetic, yes - the feed becomes basically impossible to read in plain-text. Enable escaping on all characters and try to examine your feed - that’s what I see. Besides, there problems with your “escape-everything” approach:

1. I can’t handle feeds I might download using any simple text search tools (we had a discussion on that already)
2. I can’t grab these feeds and save them elsewhere without firing up a parser
3. Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
4. Every feed I now generate is grown up in raw size, which means more download times and (if escalated) a bandwidth bill. Many of the people who have to read my feeds are on dialup.
5. RSS readers get confused when fed this stuff inside an HTML containing block within an entry, they don’t know what to display.

In short - unless you have an XML parser on the recieving side (which might not be the case) the “escape-all” approach is a no-no.

I would love to keep the encoding of basic entities like ampersands, but to be exempt from this... uhm... precaution. An xml.escape_utf = false or such would be perfect. Don’t expect UTF8 users in Ruby to be running with jcode all the time (I am not, for some time).

XML Cleansing

2006-05-10T16:02:06-07:00

Besides you got a relatively foolproof way to verify if a passed string is UTF-8, if the encoding of the builder is set to UTF-8 you can just leave it as is.

XML Cleansing

2006-05-10T20:24:28-07:00

Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.

Oh, really? What browser do you use? Can it handle this?

The easiest way to verify a string is UTF-8 is to send it unpack("U*")

If you were to check the documentation, you would see that Jim has already provided means of inserting strings verbatim via the shift and symbol operators. Those that use these operators need to take extra care. I’m interesting in helping everybody else out.

I said I would write a patch to allow those that use jcode and pass in correct utf-8 to not have high bit characters escaped. I have now done so.

I’m merely a person submitting patches. It is Jim that you need to convince.

XML Cleansing

2006-05-11T06:32:30-07:00

Pronto. Will try to create a patch that leaves bona-fide UTF-8 intact if the builder is instructed for utf-8.

XML Cleansing

2006-05-11T06:51:03-07:00

Julik, something you might want to try first:

require 'builder'
b = Builder::XmlMarkup.new
b.rights "\xC2\xA9 2006"

puts "before: " + b.target!

class Builder::XmlMarkup
  def target!
    @target.gsub(/&#(\d+);/) {[$1.to_i].pack('U*')}
  end
end

puts "after:  " + b.target!

ASCII, ISO-8859-1, UCS, and Erlang

2007-09-14T06:02:15-07:00

Tony Garnock-Jones: It is important to realize that Erlang was invented (in 1987) before utf-8 was (in 1992). Now, let’s explore the relationship between ASCII, ISO-8859-1, and UCS (a.k.a. Unicode), by way of example. ...

XML Cleansing

2007-10-03T18:23:21-07:00

I’ve written a C implementation of your code here:

[link]

It extends the String class by providing the fast_xs
method to it (equivalent to to_xs) and is roughly
70 times faster. Hooked into Builder::XmlMarkup,
this provides roughly a ten-fold increase on some
RSS feeds I’m testing with Rails (1.2.3).

I’ll tell Jim and _why (Hpricot) about it, too

XML Cleansing

2009-04-29T07:54:03-07:00

Hi Sam,

I’m using Builder and attempting to send serialized ruby into an xml node (Marshal.dump) — this doesn’t seem to be working though when I call Marshal.load after parsing the xml. Is this ruby’s Marshal object or a limitation of the encoding scheme?

Thanks,
Matt

Escaping XML in Ruby

2009-07-28T06:45:09-07:00

Looked around and found this post from Sam Ruby that wrote the code to escape XML that was included in builder . Here is a short class I wrote to abstract out the XML escaping functionality, and be sure it is a string before calling to_xs on it....

XML Cleansing

2020-04-27T06:28:18-07:00

You can begin a high benefit beautifiers business with essential hardware for low beginning up cost.