Aurélio, Küng, Stärk, Uña, Łuksza, these are but a few of the names of contributors to the ASF. Names which contain non-ASCII characters. Characters that subversion doesn’t deal with consistently between Mac and other platforms.
I can map these (as well as a few others) names (albeit in a lossy manner) to subversion-safe file names using the following JavaScript:
But at this point, it occurs to me that such a set of mappings must have been done before. The Wikipedia entry for Umlaut indicates that there is such a set of rules for Telex devices, but I have been unable to locate these rules. Anybody have a pointer?
I doubt that there is an world-wide authorative mapping. Cursory search on the ITU-T website also doesn’t show anything in this regard. Transcribration of umlauts an other non-ASCII characters depends on language, culture, nationality and time. There are to many different systems to account for all. The german “rules” as described in the Duden are “ä” → “ae”, “ö” → “oe”, “ü” → “ue”, “ß” → “ss” and in the case of potential missunderstanding “ß” → “sz”. Other languages have different rules, following that in other languages the german umlauts aren’t umlauts but diaeresis or just independent vowel sounds.
Fun fact: in official german document it is highly discouraged to transcribe people’s names becauce of potential ambiguity. This is one of the small reasons that the capital sharp s was encoded in Unicode some time ago.
The politically correct thing to do would be to use the appropriate rule from the persons national alphabet for transcribration. Although asking all names is a very depressing work. The sane thing to to is to fix subversions unicode problems. I’d escape non-ASCII-charakters until then to safely transform them back. I like TeX’s notation for that since it’s still easily readable.
unidecode("Aurélio, Küng, Stärk, Uña, Łuksza") produces "Aurelio, Kung, Stark, Una, Luksza". I would have hoped for a mapping to digraphs for at least the characters containing umlauts. At least it does better with unidecode("Tepaße") => "Tepasse".
Meanwhile, clearly I need at least one more mapping:
There can be no single mapping that actually makes sense, it depends on the source language and culture. For example, my surname is Jägenstedt and I always write it in 7-bit ASCII as Jagenstedt. Seeing it as “Jaegenstedt” would surely make at least a little annoyed. What I’d do is to unicode-normalize it as NFD and then throw away all the the combining characters (which would be umlauts, diacritics, etc). What you get is basically the same string “without all the squiggly stuff”. Of course I’d limit it to the characters for which there are compat issues.
Tim, that’s true, but the subversion bug seems to be specifically about HFS+ storing filenames in a decomposed form, so characters which can’t be decomposed should be safe. Also, you’ll never be able to automatically latinize e.g. Chinese, so touching as little as possible seems wise here.
Karl: to date, everybody has provided us with a name using Latin (or possibly extended Latin) names, so we would record Mr. Yamada.
Yes, svn handles utf-8. Unfortunately, that is not sufficient. Jägenstedt can be represented as either J\xc3\xa4genstedt or Ja\xcc\x88genstedt in utf-8. Windows and Linux use the former form, Mac uses the latter.
This has been done many times before, there’s no need to maintain your own tables. iconv can transliterate based on locale which is normally a misfeature but just what you want in this case:
Make sure you have the required locales installed, in Debian via dpkg-reconfigure locales and for Ubuntu locale-gen localename, otherwise you just get question marks.
ICU has a truly impressive number of conversions available:
Seeing it as “Jaegenstedt” would surely make at least a little annoyed.
The same applies to Finnish names. (Even though on Finnish passports, the ASCII-ization for the optically readable part is transliterated the German way—presumably due to Germans having gotten their way in a standards committee somewhere.)