Dealing with broken utf-8 with Ruby

2009-09-07T23:14:30Z

I’ve know that str.split had an idiom which split strings into characters (by default, bytes in Ruby 1.8) if passed a regular expression that matched a zero-width string, thus:

string.split(//)

What I didn’t know was that this idiom was sensitive to either the encoding of the regular expression (example: //u) or the value of KCODE.

But that’s not the most important part. The interesting thing is how this function deals with splitting of strings that are not correctly encoded UTF-8. It doesn’t complain, it simply splits them out too. This makes cleanup (either simple removal or replacement with substitution characters) a breeze.

Changes: rails, instiki.