Ruby 1.9 Strings — Updated
My confusion from yesterday was due to a bug, which was promptly fixed — test case, fix.
Now that I understand what is intended, the situation is a lot clearer. In Python 3.0, there are two types of strings, Bytes and Unicode, and the determination of the type is static. With Ruby 1.9, there is one type of string, and the associated encoding is mutable. The internal state of a given sequence of bytes with respect to the current encoding is: UNKNOWN, 7BIT, VALID, and BROKEN. UNKNOWN is a mechanism to delay the binding, and the combination of the bug and the delayed binding made the situation confusing as correctness of the result produced depended on the order of the operations performed.
The bug affected gsub!, but not sub, sub! or gsub. With the released 1.9.0 version of Ruby, gsub! the state of the resulting string was not updated. Oops. Now that that is corrected, everything works as expected, for some values of expected. Things I was not previously aware of:
Array#packdoes not set the encoding. For some cases, it is arguable that the encoding could be inferred, including the common idiom of[Fixnum].pack('U'), but Ruby 1.9 makes no attempt to do so.Fixnum.chr(Encoding)is the preferred alternative.String#ascii_only?andString#valid_encoding?may be used to probe the internal state of a given string. Once probed, the state is no longerUNKNOWN.- The meaning of
"\xXX"depends on the encoding declared in the source file. This may be turn out to be handy. - Locale environment variables only affect the interpretation of data files, not source files. This policy seems defensible.
- In addition to
"\uXXXX", unicode strings may be expressed as"\u{X}"where X may be a space separated sequence of hex strings of any length.\u{10464} is a Faihu character and"\u{a3 a5 20ac}"produces the pound, yen, and euro characters.
The net result of all this is that any sequence of operations that produce a runtime exception in Ruby 1.9 would also produce a runtime exception in Python 3.0. Some use cases that are entirely safe will not produce an exception in Ruby 1.9 when they would in Python 3.0. Such an approach is entirely consistent with a dynamic language.
Sam Ruby: Ruby 1.9 Strings - Updated
Sam Ruby: Ruby 1.9 Strings—Updated . A follow up to yesterday’s post: Sam’s principle complaints about Ruby 1.9’s character encoding support were down to a bug which has now been fixed....Excerpt from Simon Willison's Weblog at
Sam Ruby: Ruby 1.9 Strings - Updated
Simon Willison : Sam Ruby: Ruby 1.9 Strings - Updated - Sam Ruby: Ruby 1.9 Strings—Updated. A follow up to yesterday’s post: Sam’s principle complaints about Ruby 1.9’s character encoding support were down to a bug which has now been fixed....Excerpt from HotLinks - Level 1 at
Sam Ruby: Ruby 1.9 Strings — Updated
A useful explanation of some of the details of how Ruby 1.9 handles unicode...Excerpt from del.icio.us/tag/ruby at
Sam, I feel like you’ve given us only the barest taste of the detail about Ruby Unicode. Where do we get the whole enchilada? For example, what happens if I concatenate strings with different encodings? How fast is character addressing? What version of Unicode is supported? What is the relationship to non-Unicode character sets? What character sets and encodings are supported? Is there a document that answers these kind of questions?
Posted by Paul Prescod at
Where do we get the whole enchilada?
If I knew that, I would simply have pointed to it.
Posted by Sam Ruby atlinks for 2007-12-30
Sam Ruby: Ruby 1.9 Strings — Updated A useful explanation of some of the details of how Ruby 1.9 handles unicode (tags: ruby strings unicode) Recommend this post:...Excerpt from a work on process at
Ruby 1.9: Not For Rails
Do NOT install or upgrade to Ruby 1.9 if you’re using Ruby for Rails development. There, that warning ought to suffice. On Dec. 25 Matz announced that a development release of Ruby 1.9 was available in which the Ruby 1.9 spec has been frozen:...Excerpt from Binary Code at
Unicode Strings and byte buffers
Prior to Unicode there was ASCII or ISO 8859-1 (except for Microsoft that used their own encoding to lock-in users) and string manipulation was not hard. Now, Unicode is the future since everyone wants an easy solution to integrate all the...Excerpt from edpeur public mind dump at
sjs on 3 + 1 = 2: I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
Sam Ruby has also updated the article, and has a [new post]([link]) about Ruby 1.9 strings. It was a bug in `gsub!`....Excerpt from reddit.com: what's new online at
sjs on 3 + 1 = 2: I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
Sam Ruby has also updated the article, and has a [new post]([link]) about Ruby 1.9 strings. It was a bug in `gsub!`....Excerpt from all: what's new online at
sjs on 3 + 1 = 2: I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
Sam Ruby has also updated the article, and has a [new post]([link]) about Ruby 1.9 strings. It was a bug in `gsub!`....Excerpt from programming: what's new online at