Sam Ruby

Dealing with broken utf-8 with Ruby

2009-09-07T19:14:30-04:00

I’ve know that str.split had an idiom which split strings into characters (by default, bytes in Ruby 1.8) if passed a regular expression that matched a zero-width string, thus:

string.split(//)

What I didn’t know was that this idiom was sensitive to either the encoding of the regular expression (example: //u) or the value of KCODE.

But that’s not the most important part. The interesting thing is how this function deals with splitting of strings that are not correctly encoded UTF-8. It doesn’t complain, it simply splits them out too. This makes cleanup (either simple removal or replacement with substitution characters) a breeze.

Changes: rails, instiki.

Dealing with broken utf-8 with Ruby

2009-09-07T20:15:10-04:00

submitted by gthank [link] [comment]...

Dealing with broken utf-8 with Ruby

2009-09-07T21:53:36-04:00

Learning about this was certainly the highlight of my holiday weekend.

I guess that shows what an exciting life I lead ...

Thanks for the correspondence on the subject.

Dealing with broken utf-8 with Ruby

2009-09-08T04:42:23-04:00

Reverse engineering information (originally produced by Jacques):

$ irb
irb(main):001:0> "\xF0elephant".split(//u)
=> ["\360ele", "p", "h", "a", "n", "t"]
irb(main):002:0> "\xEFelephant".split(//u)
=> ["\357el", "e", "p", "h", "a", "n", "t"]
irb(main):003:0> "\xC2elephant".split(//u)
=> ["\302e", "l", "e", "p", "h", "a", "n", "t"]
irb(main):004:0> "\xFFelephant".split(//u)
=> ["\377", "e", "l", "e", "p", "h", "a", "n", "t"]

Could be marginally improved: in UTF-8, no bytes in a multi-byte sequence are above \x7F, and all bytes above \x7F are intended to be a part of a multi-byte sequence. I guess if it were important, visible ASCII bytes can be reclaimed using something like “\360ele”.gsub(/[^\x09\x0A\x0D\x20-\x7F]/,'').

Dealing with broken utf-8 with Ruby

2009-09-08T17:26:49-04:00

Indeed, one can employ a more aggressive algorithm for “recovering” from bad utf-8. That’s, in fact, what browsers do. And it’s the discrepancy between the way browsers recover and the way Ruby does (as seen above) that is the basis of the XSS exploit Rails 2.3.4 is supposed to fix.

If you develop web apps in Ruby (not just Rails), and accept user input, it’s very important to clean the user’s utf-8 input before sanitizing it. Otherwise, when you try to sanitize the user’s input using something that compares characters, you run afoul of the fact that Ruby has a different notion of how to divide a malformed utf-8 string into characters than browsers do. So you may think you’ve sanitized the input (and have, as far as Ruby’s concerned), but you actually haven’t (as far as the browser’s concerned).

Instiki does the right thing on all user input. (The change is that it now applies String::purify, rather than merely rejecting bad utf-8.)