I’ve got portions of HTML5lib working on Ruby 1.9, enough to pass Mars's unit tests. My initial reaction to Ruby 1.9’s support isn’t favorable. I definitely like Python 3K's Unicode support better. This feels closer to Python 2.5. In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over.
I’ve got portions of HTML5lib working on Ruby 1.9, enough to pass Mars's unit tests. My initial reaction to Ruby 1.9’s support isn’t favorable. I definitely like Python 3K's Unicode support better. This feels closer to Python 2.5. In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over. An example that fails with Ruby 1.9:
[0x2639].pack('U') + "\u2639"
The error that is produced is ArgumentError: character encodings differ
. The left hand side specifies packing as UTF-8. The right hand side is expressed as Unicode, which Ruby represents as #<Encoding:UTF-8>
. The problem is that the left hand side is actually stored as #<Encoding:ASCII-8BIT>
which is a misnomer. In many ways this mirror’s Python 2.x’s <type 'str'>
vs <type 'unicode'>
except that with Ruby 1.9 both Strings are the same type.
Ruby 1.9 both mitigates and compounds the problem by providing a number of implicit conversions. Sometimes. Take a look at this code which produces this output. Specifically, look at rows 2 and 4, where two Strings, of the same type, encoding, length, and value produce different results when concatenated with UTF-8 strings. This type of magic destroys any confidence I have in unit testing as a viable strategy.
Update: no magic, just a bug.
My preference would be that #<Encoding:ASCII-8BIT>
be abolished, in favor of #<Encoding:ASCII-7BIT>
and a separate Bytes class. Generally, programmers would only see objects of class Bytes if they do “binary” file I/O, explicitly create constants of that type, or invoke methods such as String#bytes
.
Other suggestions:
Array#pack('U')
should behave like .map {|n| n.chr('UTF-8')}.join
If Ruby is going to support the specification of the default encoding on the command line, it should support Locale environment variables too.
If REXML is going to remain in the core libraries for Ruby, is should have a thorough audit. As XML is defined in terms of Unicode, REXML should never return binary strings. It also needs to be checked to prevent things like this from showing through:
rexml/element.rb:555: warning: Hash#index is deprecated; use Hash#key
Frankly, I’m a bit concerned that REXML is essentially unmaintained at this point: the mailing list is unresponsive, and bug reports appear to be addressed sporadically and new releases all too often seem to produce a regressions.
Sam,
It sounds like your complaint is with Array.pack and the rexml library, not with all of Unicode in Ruby 1.9.
Given that the point of Array.pack is to serialize data into byte strings, I think its behavior is probably correct as it is. Admitedly confusing, though. A documentation clarification is probably in order. (Though pack() has always been a confusing method!)
Instead of using pack to convert Unicode codepoints to strings, try the Integer#chr method, with the desired encoding as an argument. (Your comment system won’t allow me to enter an example: it must think that I’m embedding JS or something).
I don’t know anything about the rexml library. But the 1.9.0 is not really expected to be stable yet, and I suspect that there are a number of libraries that haven’t been carefully ported yet.
Like so much of Ruby, I think you’ve got to give the Unicode support a chance to grow on you. I don’t understand why Matz made some of the choices he did, but they seem to work okay. Keep in mind, too, that the goal was not just to support Unicode but also to support Japanese encodings as well. So some of the design decisions might make a lot more sense to programmers who have to work with SJIS and EUC every day.
Finally, Ruby does inherit the default external encoding from the locale if you don’t specify an encoding with -K, -E or --encoding. This is the encoding assumed when you read from a file and do not specify a different encoding. (It is not used when you write to a file or read or write from a socket or pipe, however.) It respects the standard LC_CTYPE, LC_ALL, and LANG variables. Encoding.default_external returns the value. Encoding.locale_encoding didn’t make it into 1.9.0, but it is in the current sources and returns the default encoding for the locale even if -K, -E, or --encoding is specified.
(I attempt to explain all this in The Ruby Programming Language which should be in bookstores in about a month. I’m making the last-minute changes today.)
David Flanagan
I clearly am aware of Fixnum#char, as evidenced by my first example.
For a pack
-free example of inconsistent behavior, compare test1.rb with test2.rb. The character encodings only differ when UTF-8 is explicitly specified???!!!
Respecting LANG
is good news for data files. Based on the example above, I take it that it doesn’t work for program files. Sorry for being unclear, that’s what I was referring to.
Sam,
Sorry that I didn’t read your post more carefully to see that you were already using chr. Given that chr exists to convert codepoints to characters, pack seems like a hacky way to attempt the same thing.
If a string literal contains a Unicode \u escape, then it will have utf-8 encoding.
Otherwise, if a string literal only has 7-bit ASCII characters, then it will have ASCII-8BIT encoding--essentially the legacy encoding from Ruby 1.8
Strings that are not 7-bit clean take their encoding from the source encoding of the file. The source encoding is specified with the coding comment you have in your test1.rb. Files that do not have a coding comment like that take their source encoding from the -K, -E, or --encoding command-line option. And if none of those are specified, then they are assumed to be ASCII-encoded. So if you run test2.rb with -Ku it ought to work the same as test1.rb.
The fact that the meaning of a string literal is dependent on the source encoding means that it is really important to start your Ruby programs with a coding comment. And it also helps to explain why the source encoding of a file is not derived from the locale--changing the locale could break the program.
Does this clarify anything? I’m not sure whether I’m actually addressing your point here or not.
Sam,
I was wrong in my first comment about Encoding.locale_encoding. That is a newly-added internal method, not exposed by the API. You can use Encoding.locale_charmap to obtain the encoding name (as a string) for the current locale, if you need, for some reason, to distinguish it from Encoding.default_external.
David
The fact that the meaning of a string literal is dependent on the source encoding means that it is really important to start your Ruby programs with a coding comment.
Did you actually try test1.rb? That code throws an exception if the coding comment is present, and works when it is not present. It took me quite a while to figure out why REXML (which uses pack
by the way) worked when the exact same code copied into my source file (which uses the recommended coding comment) did not.
I am still at a loss why the second row and fourth row differ (but, again, only if the really important coding comment is present). And if the coding comment is not present, you get a completely different set of results.
One thing I like about Python and Ruby is that they are both approachable. But for the life of me, Ruby 1.9’s behavior in this area is virtually unpredictable.