If you accept data from various sources, and want to produce XML that can be consumed, one thing you need to be careful about is character set issues.
On the input side, people often lie or make mistakes. Many don’t specify an encoding, and while XML’s default is utf-8, it is common to find iso-8859-1 or even win-1252 data.
On the output side, if you want to produce something that can be consumed, then it behooves you to be aware that the quality of XML parsers out there varies widely. Many of the initial feed aggregators were no better than regular expressions, simply ignoring character set issues and slapping descriptions into HTML. While there has been much improvement on this front, many still fall back to such behaviors when the encounter other, unrelated, problems.
Carrying forward the experience I gained with my existing Python implementation of my weblog, I’ve come up with xchar.rb: some data, two small methods, and six tests.
One potential use of this would be in Ruby’s XML Builder:
class Builder::XmlMarkup < Builder::XmlBase def _escape(text) text.to_xs end end
“May don’t specify an encoding”
That’s ‘Many’, not ‘May’, right?
one thing you need to be careful about is character set issues.
That should be character encoding issues, right?
Sooo, this is why all my new Rails apps produce unreadable XML with the new Builder...
Sam, can you guess in three steps why this was a very bad solution (as I figure you coined it)?
Sam, can you guess in three steps why this was a very bad solution
Encoding other than utf-8?
I can point to Rails apps that produced unreadable XML with the old builder.
If you provide more details, I can construct a test case and a fix.
Just to be clear, it is not exactly raw. There are a number of characters that must be escaped. <
and &
, to name but two.
I just want to make sure that your issue is a cosmetic one, not a functional one. I guessed before what your issue might be, and apparently I guessed wrong. I’d like to not guess any more.
If it turns out that your issue is just the cosmetic one, I am quite prepared to make a patch that assumes that people that require 'jcode'
and set $KCODE='u'
are prepared to handle utf-8. Everyone else will get a bulkier, but safer, result.
Well, for me the issue is mostly cosmetic, yes - the feed becomes basically impossible to read in plain-text. Enable escaping on all characters and try to examine your feed - that’s what I see. Besides, there problems with your “escape-everything” approach:
1. I can’t handle feeds I might download using any simple text search tools (we had a discussion on that already)
2. I can’t grab these feeds and save them elsewhere without firing up a parser
3. Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
4. Every feed I now generate is grown up in raw size, which means more download times and (if escalated) a bandwidth bill. Many of the people who have to read my feeds are on dialup.
5. RSS readers get confused when fed this stuff inside an HTML containing block within an entry, they don’t know what to display.
In short - unless you have an XML parser on the recieving side (which might not be the case) the “escape-all” approach is a no-no.
I would love to keep the encoding of basic entities like ampersands, but to be exempt from this... uhm... precaution. An xml.escape_utf = false or such would be perfect. Don’t expect UTF8 users in Ruby to be running with jcode all the time (I am not, for some time).
Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
Oh, really? What browser do you use? Can it handle this?
The easiest way to verify a string is UTF-8 is to send it unpack("U*")
If you were to check the documentation, you would see that Jim has already provided means of inserting strings verbatim via the shift and symbol operators. Those that use these operators need to take extra care. I’m interesting in helping everybody else out.
I said I would write a patch to allow those that use jcode and pass in correct utf-8 to not have high bit characters escaped. I have now done so.
I’m merely a person submitting patches. It is Jim that you need to convince.
Julik, something you might want to try first:
require 'builder' b = Builder::XmlMarkup.new b.rights "\xC2\xA9 2006" puts "before: " + b.target! class Builder::XmlMarkup def target! @target.gsub(/&#(\d+);/) {[$1.to_i].pack('U*')} end end puts "after: " + b.target!
I’ve written a C implementation of your code here:
It extends the String class by providing the fast_xs
method to it (equivalent to to_xs) and is roughly
70 times faster. Hooked into Builder::XmlMarkup,
this provides roughly a ten-fold increase on some
RSS feeds I’m testing with Rails (1.2.3).
I’ll tell Jim and _why (Hpricot) about it, too
Hi Sam,
I’m using Builder and attempting to send serialized ruby into an xml node (Marshal.dump) — this doesn’t seem to be working though when I call Marshal.load after parsing the xml. Is this ruby’s Marshal object or a limitation of the encoding scheme?
Thanks,
Matt