Skip to the content.

G+: Perl 6 has strong support for Unicode

David Coles
Perl 6 has strong support for Unicode. In particular, it's one of the few languages that presents Unicode strings as a sequence of graphemes rather than code points. There's also some funky mathematical notation, which may seem strange in the ASCII-centric parts of the world, could be plausibly used by those who use an IME on a day to day basis.

One fun fact I learnt: In Unicode, \r\n is a single grapheme.

Day 7 — Unicode, Perl 6, and You


(+1's) 1
David Coles
Also good reading on the topic is +Matt Giuca 's post on The importance of language-level abstract Unicode strings: https://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

David Coles
Swift also has fairly modern Unicode support, where each Character in a string is a single Extended Grapheme Cluster.
https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID293

Matt Giuca
Heh, that is cool. In my blog post I make a couple of arguments against counting graphemes (see "What about combining characters?"), although I wasn't arguing particularly strongly against it.

It's good that Perl 6 seems to count graphemes by default but also offer access to codepoints. That seems ideal. Still, I wonder about performance --- do you have constant time access to graphemes? (In which case, strings would need to be stored as a list of pointers to arrays of codepoints.)

David Coles
Taking a look at http://design.perl6.org/S15.html it appears that Perl stores characters in a special NFG (Normalization Form Grapheme) form. It's like NFC form where graphemes are compressed into pre-composed forms, but also maps other graphemes to an internal representation. That way you can look up graphemes quickly, but may need to decompose them if you want the actual code-points.

Matt Giuca
I'll have a look later. The first thought that comes to mind (without reading) is that graphemes can be comprised of arbitrarily many code points.