It’s just data

ASCII, ISO-8859-1, UCS, and Erlang

Tony Garnock-Jones: Erlang represents strings as lists of (ASCII, or possibly iso8859-1) codepoints. In this regard, it’s weakly typed - there’s no hard distinction between a string, “ABC”, and a list of small integers, [65,66,67].

It is important to realize that Erlang was invented (in 1987) before utf-8 was (in 1992).

Codepoints

Let’s explore the relationship between ASCII, ISO-8859-1, and UCS (a.k.a. Unicode), by way of example.

First, let’s look at U+0043: Latin capital letter C.  The codepoint for this character in UCS is 67 decimal.  The codepoint for this character in ISO-8859-1 is 67 decimal.  The codepoint for this character in ASCII is 67 decimal.

Next, let’s take a look at U+00C7: Latin capital letter C with cedilla.  The codepoint for this character in UCS is 199 decimal.  The codepoint for this character in ISO-8859-1 is 199 decimal.  This character can’t be represented in ASCII.

Finally, let’s look at U+0421: Cyrillic capital letter Es.  The codepoint for this character is 1057 decimal.  This character can’t be represented in either ASCII or ISO-8859-1.

Given no other information, I would suggest that a string in Erlang be treated a list of UCS codepoints, where UCS is a proper superset of ISO-8859-1, which in turn is a proper superset of ASCII.

Binary

As of Unicode 5.0.0, 102,012 code points are defined.  This number is a bit larger than 256, which is the number of possible values that can be stored in a byte.  So, in general, UCS codepoints will require more than one byte to be represented.

ASCII is simple.  Everything is stored in one byte.  A bit incomplete, but simple.

ISO-8859-1 is simple.  Everything is stores in one byte.  A bit incomplete (but not as incomplete as ASCII), but still simple.

UTF-32 is simple.  Everything gets 32 bits.  A bit wasteful, but simple.

UTF-16 is nearly as simple.  Code points less that 65,536 are stored as two bytes.  Everything else is stored as four.  This works as the range of UCS isn’t contiguous, in particular, the range of U+D800 to U+DFFF is reserved for “surrogate characters”.

Be forewarned that there actually are two version of UTF-32 and UTF-16, depending on whether your machine is big endian or little endian.

UTF-8 is simple for those characters which it shares with ASCII.  Those characters require only one byte.  Everything else requires more than one byte.  So a Latin capital letter C is 0x43 in UTF-8.  A Latin capital letter C with cedilla is 0xC387.  A Cyrillic capital letter Es is 0xD0A1.  One important aspect of UTF-8 is that it is rare that a given sequence of bytes which contains at least one non-ASCII character can be interpreted as a UTF-8 character.

For this reason, I would suggest that an RFC 4267 JSON codec for Erlang that choses to represent strings as binary make the assumption that binary sequences are UTF-8; and furthermore that those bytes that can not be interpreted as UTF-8 be treated as ISO-8859-1.  That sounds complicated, but that’s exactly what this patch does, i.e., if the next two, three, or four bytes match one of the utf-8 patterns, then those bytes are treated as a single character, otherwise that one byte is treated as a single character.

If this approach is taken, all ASCII binary streams will be interpreted corrected, as will all UTF-8 binary streams.  As a bonus: nearly all ISO-8859-1 binary streams will be too.

Converting a string to binary in Erlang

Converting a UCS string to binary can be done with list_to_binary(xmerl_ucs:to_utf8(Value)).  This pair of function calls will work for all positive integers which represent valid Unicode codepoints, including all codepoints that may be defined in the foreseeable future (and, yes, from time to time, new characters are added).

Converting an ISO-8859-1 string to binary can be done with list_to_binary(Value).  This function call will fail if it encounters an element of the list which is greater that 256.  This call will result in a same binary representation as the previous call for all codepoints less than 128.  It will result in a different binary representation than the previous call for all codepoints greater than 127.

Converting an ASCII string to binary can be done with either of the above two methods.

Footnotes

For completeness, there are two other things that may be worth exploring.  Neither require much in the way of code, merely a few additional patterns to be matched.

Tony, if you are interested in pursuing any of these ideas in rfc4627.erl, I can provide test cases and/or patches.  Let me know.


Sam Ruby: ASCII, ISO-8859-1, UCS, and Erlang

Converting a string to binary in Erlang...

Excerpt from del.icio.us/tag/erlang at

People who want a little more depth on all these UTF-whatevers may find [link] useful.

Posted by Tim Bray at

I would suggest that Erlang grows a real string type. Dealing with strings represented by something else is just a kludge.

Posted by Manuzhai at

WF I: Erlang Ho!

This is the first progress report from the Wide Finder Project . Erlang is the obvious candidate for a Wide Finder implementation. It may be decades old but it’s the new hotness, it’s got a PragBook ( @Amazon ), I hear heavy breathing from serious...

Excerpt from ongoing at

[erlang]UCS とか utf8 とか。

Sam Ruby: ASCII, ISO-8859-1, UCS, and Erlang [link] Erlang Notes [link] 正規表現と UTF-8 を簡単に使えるようにするのが大事かな。 正規表現は R12B-2 で高速なものが導入され...

Excerpt from Twisted Mind at

links for 2008-04-26

defmacro - Erlang Style Concurrency Another good explanation of what makes Erlang’s concurrency / scalability design so compelling. (tags: advocacy concurrency erlang programming threads java ) ASCII, ISO-8859-1, UCS, and Erlang (Sam Ruby) (tags:...

Excerpt from have browser, will travel at

[TIP] Erlang에서 한글 혹은 utf-8 혹은 non-ascii 처리 (list_to_binary)

xml 파일을 저장하기위해 다음과 같은 함수를 만들었다. save_xml(Path, RootEl) ->     {ok,IOF}=file:open(Path,[write]),     Export=xmerl:export_simple([RootEl], xmerl_xml),     io:format(IOF,"~s~n", [lists:flatten(Export)]). 에러가 났다. 83>...

Excerpt from 개발, 검색, 함수 at

joshix: @DeepSpawn This may be interesting, tho not really an "answer": http://intertwingly.net/blog/2007/09/14/ASCII-ISO-8859-1-UCS-and-Erlang

joshix’s status on Sunday, 25-Oct-09 21:34:18 UTC...

Excerpt from frankenspock and friends at

Add your comment