intertwingly

It’s just data

Dealing with broken utf-8 with Ruby


I’ve know that str.split had an idiom which split strings into characters (by default, bytes in Ruby 1.8) if passed a regular expression that matched a zero-width string, thus:

string.split(//)

What I didn’t know was that this idiom was sensitive to either the encoding of the regular expression (example: //u) or the value of KCODE.

But that’s not the most important part.  The interesting thing is how this function deals with splitting of strings that are not correctly encoded UTF-8.  It doesn’t complain, it simply splits them out too.  This makes cleanup (either simple removal or replacement with substitution characters) a breeze.

Changes: rails, instiki.