RedHanded » String#chars Blessed by Rails Papacy as Fingertips Work to Dip Ruby's Toes in the UTF-8 Waters

String#chars Blessed by Rails Papacy as Fingertips Work to Dip Ruby's Toes in the UTF-8 Waters #

by why in cult

Cool, good news, the Fingertips’ ActiveSupport::Multibyte is checked into Rails. This equips every string with a chars method which offers a host of friendly Unicode-aware methods for the string. You know: reverse, length, slice and the like.

 name = 'Claus Müller'
 name.reverse #=> "rell??M sualC" 
 name.length #=> 13

 name.chars.reverse.to_s #=> "rellüM sualC" 
 name.chars.length #=> 12

This is valuable, essential work. Thanks to Thijs, Manfred, Julik and related contributors. Along with Nikolai’s work, we’re on our way.

09 Oct 2006 at 14:03 | 15 comments

Simen

said on 09 Oct 2006 at 15:37

So, is this a Rails-only thing? Cool nonetheless.

kode

said on 09 Oct 2006 at 15:40

fïñällÿ, sömë ümläüts?

FlashHater

said on 09 Oct 2006 at 16:18

It’s ActiveSupport, AKA this-should-be-in-stdlib-but-it’s-not. It’s how the Rails people say “Hey! Look at this! We need it!”, but with working examples to boot. Of course, I don’t like HashWithIndifferentAccess replacing Hash, if that’s still on.

FlashHater

said on 09 Oct 2006 at 16:19

Thugly

said on 09 Oct 2006 at 16:34

This is indeed fantastic. I respect Japanese sensibilities about Unicode, I really do, but seeing as it’s the only viable option for 90% of the world, Unicode is going to have to make its way into Ruby one way or another.

kmeyer

said on 09 Oct 2006 at 16:50

I like this just a tad bit better than a PHP -like str.mb_[name of other method here], although you could probably implement that through method_missing. Does this mean String#chars returns an instance of a “character string” object?

Atnan

said on 09 Oct 2006 at 20:50

kmeyer: Yep.

>> ''.chars
=> #<ActiveSupport::Multibyte::Chars:0x60ccd8 @string="">

Thijs

said on 10 Oct 2006 at 02:21

Thugly, Unicode 5.0 and UTF -8 is actually a perfectly viable option for 99.9% of the world.

Please take some time to read http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html and http://en.wikipedia.org/wiki/Han_unification

Dr Nic

said on 10 Oct 2006 at 02:23

Does it come with free foreign language lessons? That’d be neat.

Manfred

said on 10 Oct 2006 at 02:26

kmeyer, flooding the String class with mb_* methods or mistifying it with method_missing doesn’t seem like a good option to me.

Manfred

said on 10 Oct 2006 at 02:27

Nic, you can come by the office, we’ll teach you some Dutch (:

fio

said on 10 Oct 2006 at 04:10

Looks like there’s a way without method_missing

http://www.bigbold.com/snippets/posts/show/2786

WWWWolf

said on 10 Oct 2006 at 05:16

Nic: Yep, there’s quite a few of helpful language lessons included in the documentation, like 文字化け [“changed characters”] being the proper Japanese term for encoding difficulties, and ”Nyt on vuosi 2006 ja skandit ei vieläkään toimi, perkele” [“It’s year 2006 and Scandinavic letters still don’t work, goddamnit”] is basically the most common term for the same phenomenon in Finnish. =)

But all this is just transitional! Chapter 2 of the language lessons includes stuff that can be used after the transition, like ”Oho, kato, perkele, sehän toimii!” [“Oh, look, goddamn it, it actually works!”]

Learning the proper terminology from around the world helps, because there’s nothing, nothing in this world that unites the various nations like the fact that Characters Still Get Funnily Messed Up Every Now And Then, and if and when Unicode finally supports all obscure writing systems to full extent (where’s my Tengwar, darn it), the fact that people can finally communicate.

Transitioning to new character set is always a great big pain, and I can understand why Ruby devs were opposed to implementing this, it’s a trivial issue that’s nonetheless major bore to code. Now that it’s happening, it’s a big relief.

JEG2

said on 11 Oct 2006 at 14:09

Please don’t say that Matz was adverse to solving the multilingualization problem. He has, in fact, been working hard on it for years now. He’s in the percentage of the world who’s problems are not completely solved by Unicode, so he’s very committed to creating the correct solution. Let’s not make light of his hard work.

WWWWolf

said on 16 Oct 2006 at 12:15

Well, I really appreciate it on how Matz has worked on this and seeing UTF -8 finally happen in Ruby is very cool. I just say I understand why they’ve been a little bit reluctant to get to the job and are kind of late compared to other languages: Going to yet another character set, when they’ve just got the previous ones working fine. It’s probably been an annoying problem. I’m glad it’s being resolved.

In Finland, we had tons of weird character sets: the “seagull wing scandinavics” based on 7bit ASCII , IBM CP 850 (?) and 437 conflicts… and then it finally occurred that everyone wanted Latin-1! And there was much rejoicing. And then came EU and gave us a new currency (€), and then the Language Planning Department says “hey, quit messing up things by using sh and zh, we now have š and ž!” And then the computer guys wake up and realise they’ve just got everything running nicely on Latin-1 and Latin-1 doesn’t have €, š, Š, ž, or Ž! The Boss says “Oh yeah? Well let’s use Latin-15.” The Smart People say “Let’s go to UTF -8 while we are at it.” And UTF -8 is nice, but looks funny when the app somehow expects to see Latin-1… it’s 2006 and the scands are still not working. (And people keep writing “5 euroa” and “shakki”. =)

So I’m just trying to say that I foresee a little bit of childhood diseases with this new cool stuff, that’s all.

Archive

Links

Syndicate

String#chars Blessed by Rails Papacy as Fingertips Work to Dip Ruby's Toes in the UTF-8 Waters #

Simen

kode

FlashHater

FlashHater

Thugly

kmeyer

Atnan

Thijs

Dr Nic

Manfred

Manfred

fio

WWWWolf

JEG2

WWWWolf

PREVIEW PANE