Dealing with broken utf-8 with Ruby
I’ve know that str.split had an idiom which split strings into characters (by default, bytes in Ruby 1.8) if passed a regular expression that matched a zero-width string, thus:
string.split(//)
What I didn’t know was that this idiom was sensitive to either the encoding of the regular expression (example: //u
) or the value of KCODE
.
But that’s not the most important part. The interesting thing is how this function deals with splitting of strings that are not correctly encoded UTF-8. It doesn’t complain, it simply splits them out too. This makes cleanup (either simple removal or replacement with substitution characters) a breeze.