It’s just data

Telex Digraph Mappings

Aurélio, Küng, Stärk, Uña, Łuksza, these are but a few of the names of contributors to the ASF.  Names which contain non-ASCII characters.  Characters that subversion doesn’t deal with consistently between Mac and other platforms.

I can map these (as well as a few others) names (albeit in a lossy manner) to subversion-safe file names using the following JavaScript:

name=name.replace(/\u00e4|a\u0308/g,'ae');
name=name.replace(/\u00e5|a\u030a/g,'aa');
name=name.replace(/\u00e7|c\u0327/g,'c');
name=name.replace(/\u00e9|e\u0301/g,'e');
name=name.replace(/\u00f1|n\u0303/g,'ny');
name=name.replace(/\u00f6|o\u0308/g,'oe');
name=name.replace(/\u00f8/g,'o');
name=name.replace(/\u00fc|u\u0308/g,'ue');
name=name.replace(/\u0141/g,'L');

But at this point, it occurs to me that such a set of mappings must have been done before.  The Wikipedia entry for Umlaut indicates that there is such a set of rules for Telex devices, but I have been unable to locate these rules.  Anybody have a pointer?


Might you happen to be looking for Text::Unidecode?

Posted by Aristotle Pagaltzis at

Or this?

Posted by Jacques Distler at

I doubt that there is an world-wide authorative mapping. Cursory search on the ITU-T website also doesn’t show anything in this regard. Transcribration of umlauts an other non-ASCII characters depends on language, culture, nationality and time. There are to many different systems to account for all. The german “rules” as described in the Duden are “ä” → “ae”, “ö” → “oe”, “ü” → “ue”, “ß” → “ss” and in the case of potential missunderstanding “ß” → “sz”. Other languages have different rules, following that in other languages the german umlauts aren’t umlauts but diaeresis or just independent vowel sounds.

Fun fact: in official german document it is highly discouraged to transcribe people’s names becauce of potential ambiguity. This is one of the small reasons that the capital sharp s was encoded in Unicode some time ago.

The politically correct thing to do would be to use the appropriate rule from the persons national alphabet for transcribration. Although asking all names is a very depressing work. The sane thing to to is to fix subversions unicode problems. I’d escape non-ASCII-charakters until then to safely transform them back. I like TeX’s notation for that since it’s still easily readable.

Posted by Tim Tepaße at

unidecode("Aurélio, Küng, Stärk, Uña, Łuksza") produces "Aurelio, Kung, Stark, Una, Luksza".  I would have hoped for a mapping to digraphs for at least the characters containing umlauts.  At least it does better with unidecode("Tepaße") => "Tepasse".

Meanwhile, clearly I need at least one more mapping:

name=name.replace(/\u00df/g,'ss');
Posted by Sam Ruby at

There can be no single mapping that actually makes sense, it depends on the source language and culture. For example, my surname is Jägenstedt and I always write it in 7-bit ASCII as Jagenstedt. Seeing it as “Jaegenstedt” would surely make at least a little annoyed. What I’d do is to unicode-normalize it as NFD and then throw away all the the combining characters (which would be umlauts, diacritics, etc). What you get is basically the same string “without all the squiggly stuff”. Of course I’d limit it to the characters for which there are compat issues.

Posted by Philip Jägenstedt at

NFD-Normalization doesn’t get to those characters which can’t be decomposed like “ß” or “Ł”

Posted by Tim Tepaße at

Tim, that’s true, but the subversion bug seems to be specifically about HFS+ storing filenames in a decomposed form, so characters which can’t be decomposed should be safe. Also, you’ll never be able to automatically latinize e.g. Chinese, so touching as little as possible seems wise here.

Posted by Philip Jägenstedt at

Use punycode. Turning everything into gibberish is fun. It will also prevent people from spoofing someone else’s name with Cyrillic characters.

Posted by anonymous at

What do you do with 山田? (Mr Yamada). I had never tried to put Japanese names into the system. I thought svn was handling utf-8. :( oh well.

Posted by karl dubost at

Karl: to date, everybody has provided us with a name using Latin (or possibly extended Latin) names, so we would record Mr. Yamada.

Yes, svn handles utf-8.  Unfortunately, that is not sufficient.  Jägenstedt can be represented as either J\xc3\xa4genstedt or Ja\xcc\x88genstedt in utf-8.  Windows and Linux use the former form, Mac uses the latter.

You can find more information in Unicode Normalization Forms.

Posted by Sam Ruby at

ObMark: Unicode Normalization Form C

Posted by Aristotle Pagaltzis at

asciize functions in: JavaScript, Ruby, Python (string), and Python (unicode); as well as a script to produce the latter forms from the former.

Posted by Sam Ruby at

Sam Ruby: Telex Digraph Mappings

3ヶ月ぐらい前に同じことで嵌ったのでうれしい(違。...

Excerpt from yssk22のブックマーク at

This has been done many times before, there’s no need to maintain your own tables. iconv can transliterate based on locale which is normally a misfeature but just what you want in this case:

% echo Küng Stärk | LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
Kueng Staerk

Make sure you have the required locales installed, in Debian via dpkg-reconfigure locales and for Ubuntu locale-gen localename, otherwise you just get question marks.

ICU has a truly impressive number of conversions available:

% uconv -L
Accents-Any Any-Accents Any-Publishing Arabic-Latin Armenian-Latin Bengali-Devanagari Bengali-Gujarati Bengali-Gurmukhi Bengali-Kannada Bengali-Latin Bengali-Malayalam Bengali-Oriya Bengali-Tamil Bengali-Telugu Cyrillic-Latin Devanagari-Bengali Devanagari-Gujarati Devanagari-Gurmukhi Devanagari-Kannada Devanagari-Latin Devanagari-Malayalam Devanagari-Oriya Devanagari-Tamil Devanagari-Telugu Digit-Tone Fullwidth-Halfwidth Georgian-Latin Greek-Latin Greek-Latin/UNGEGN Gujarati-Bengali Gujarati-Devanagari Gujarati-Gurmukhi Gujarati-Kannada Gujarati-Latin Gujarati-Malayalam Gujarati-Oriya Gujarati-Tamil Gujarati-Telugu Gurmukhi-Bengali Gurmukhi-Devanagari Gurmukhi-Gujarati Gurmukhi-Kannada Gurmukhi-Latin Gurmukhi-Malayalam Gurmukhi-Oriya Gurmukhi-Tamil Gurmukhi-Telugu Halfwidth-Fullwidth Han-Latin Hangul-Latin Hebrew-Latin Hiragana-Katakana Hiragana-Latin Jamo-Latin Kannada-Bengali Kannada-Devanagari Kannada-Gujarati Kannada-Gurmukhi Kannada-Latin Kannada-Malayalam Kannada-Oriya Kannada-Tamil Kannada-Telugu Katakana-Hiragana Katakana-Latin Latin-Arabic Latin-Armenian Latin-Bengali Latin-Cyrillic Latin-Devanagari Latin-Georgian Latin-Greek Latin-Greek/UNGEGN Latin-Gujarati Latin-Gurmukhi Latin-Han Latin-Hangul Latin-Hebrew Latin-Hiragana Latin-Jamo Latin-Kannada Latin-Katakana Latin-Malayalam Latin-NumericPinyin Latin-Oriya Latin-Syriac Latin-Tamil Latin-Telugu Latin-Thaana Latin-Thai Malayalam-Bengali Malayalam-Devanagari Malayalam-Gujarati Malayalam-Gurmukhi Malayalam-Kannada Malayalam-Latin Malayalam-Oriya Malayalam-Tamil Malayalam-Telugu NumericPinyin-Latin NumericPinyin-Pinyin Oriya-Bengali Oriya-Devanagari Oriya-Gujarati Oriya-Gurmukhi Oriya-Kannada Oriya-Latin Oriya-Malayalam Oriya-Tamil Oriya-Telugu Pinyin-NumericPinyin Publishing-Any Simplified-Traditional Syriac-Latin Tamil-Bengali Tamil-Devanagari Tamil-Gujarati Tamil-Gurmukhi Tamil-Kannada Tamil-Latin Tamil-Malayalam Tamil-Oriya Tamil-Telugu Telugu-Bengali Telugu-Devanagari Telugu-Gujarati Telugu-Gurmukhi Telugu-Kannada Telugu-Latin Telugu-Malayalam Telugu-Oriya Telugu-Tamil Thaana-Latin Thai-Latin Tone-Digit Traditional-Simplified Any-Null Any-Lower Any-Upper Any-Title Any-Name Name-Any Any-Remove Any-Hex/Unicode Any-Hex/Java Any-Hex/C Any-Hex/XML Any-Hex/XML10 Any-Hex/Perl Any-Hex Hex-Any/Unicode Hex-Any/Java Hex-Any/C Hex-Any/XML Hex-Any/XML10 Hex-Any/Perl Hex-Any Any-NFC Any-NFKC Any-NFD Any-NFKD Any-Latin Any-Syriac Any-Greek Any-Greek/UNGEGN Any-Telugu Any-Gurmukhi Any-Cyrillic Any-Hangul Any-Bengali Any-Katakana Any-Arabic Any-Thai Any-Gujarati Any-Malayalam Any-Hiragana Any-Armenian Any-Thaana Any-Han Any-Georgian Any-Oriya Any-Devanagari Any-Hebrew Any-Kannada Any-Tamil

so converting from Cyrillic is easy:

% echo доброй вечер | uconv -x Cyrillic-Latin
dobroj večer

However it doesn’t seem to transliterate the various Latin scripts to ASCII. For completeness, recode and another attempt.

[Why am I always new here, even when my details are remembered?]

Posted by James at

Markus Kuhn’s transtab is linked from [link] (search for "transtab").  It’s in a format suitable for iconv, but it’s easy enough to parse yourself.

Posted by Keith Wansbrough at

Seeing it as “Jaegenstedt” would surely make at least a little annoyed.

The same applies to Finnish names. (Even though on Finnish passports, the ASCII-ization for the optically readable part is transliterated the German way—presumably due to Germans having gotten their way in a standards committee somewhere.)

Posted by Henri Sivonen at

In Spanish there is also an ü, and it should be mapped to u: Agüero → Aguero. The ü → ue mapping makes sense in German, but not at all in Spanish.

Posted by Roberto Bonvallet at

Add your comment