I’ve know that str.split had an idiom which split strings into characters (by default, bytes in Ruby 1.8) if passed a regular expression that matched a zero-width string, thus:
string.split(//)
What I didn’t know was that this idiom was sensitive to either the encoding of the regular expression (example: //u
) or the value of KCODE
.
But that’s not the most important part. The interesting thing is how this function deals with splitting of strings that are not correctly encoded UTF-8. It doesn’t complain, it simply splits them out too. This makes cleanup (either simple removal or replacement with substitution characters) a breeze.
Learning about this was certainly the highlight of my holiday weekend.
I guess that shows what an exciting life I lead ...
Thanks for the correspondence on the subject.
Reverse engineering information (originally produced by Jacques):
$ irb irb(main):001:0> "\xF0elephant".split(//u) => ["\360ele", "p", "h", "a", "n", "t"] irb(main):002:0> "\xEFelephant".split(//u) => ["\357el", "e", "p", "h", "a", "n", "t"] irb(main):003:0> "\xC2elephant".split(//u) => ["\302e", "l", "e", "p", "h", "a", "n", "t"] irb(main):004:0> "\xFFelephant".split(//u) => ["\377", "e", "l", "e", "p", "h", "a", "n", "t"]
Could be marginally improved: in UTF-8, no bytes in a multi-byte sequence are above \x7F
, and all bytes above \x7F
are intended to be a part of a multi-byte sequence. I guess if it were important, visible ASCII bytes can be reclaimed using something like “\360ele”.gsub(/[^\x09\x0A\x0D\x20-\x7F]/,'').
Indeed, one can employ a more aggressive algorithm for “recovering” from bad utf-8. That’s, in fact, what browsers do. And it’s the discrepancy between the way browsers recover and the way Ruby does (as seen above) that is the basis of the XSS exploit Rails 2.3.4 is supposed to fix.
If you develop web apps in Ruby (not just Rails), and accept user input, it’s very important to clean the user’s utf-8 input before sanitizing it. Otherwise, when you try to sanitize the user’s input using something that compares characters, you run afoul of the fact that Ruby has a different notion of how to divide a malformed utf-8 string into characters than browsers do. So you may think you’ve sanitized the input (and have, as far as Ruby’s concerned), but you actually haven’t (as far as the browser’s concerned).
Instiki does the right thing on all user input. (The change is that it now applies String::purify
, rather than merely rejecting bad utf-8.)