RedHanded » Futurism: Unicode In Ruby

Futurism: Unicode In Ruby #

When asked about the future of Unicode in Ruby 1.9/2.0, Matz replied to Ruby-Core with the following laundry list of features he expects in Ruby’s multibyte character support:

characters are represented by single character strings.
so that "abc"[0] returns "a" instead of fixnum 97.
all string methods are aware of multibyte characters.
new method String#encoding gives character encoding name (e.g. "utf-8").
new method IO#encoding gives character encoding name for reading data.
new method IO#encoding= sets the character encoding for reading data.

A library which emulates this could be built, based on Ruby’s current iconv lib. Anybody want to take a stab at it?

07 Jan 2005 at 13:54 | 10 comments

Simon

said on 07 Jan 2005 at 15:42

do we have a time frame for the release of ruby 2? Has Matz finished with the 1.8 release now I wonder?

Simon

said on 07 Jan 2005 at 15:44

By the way, Why, how is chapter 6 of the poignant guide shaping up? Christmas has been and gone, you know…

why

said on 08 Jan 2005 at 13:11

What exactly can I say to placate you? I can’t have you in despair.

William Morgan

said on 08 Jan 2005 at 14:12

I would.

Simon

said on 09 Jan 2005 at 03:04

Hows about ‘chapter 6 is being uploaded now.’ That’d do it.

David Garamond

said on 09 Jan 2005 at 22:48

what about different charsets? Are Ruby strings going to be stored as Unicode (I assume not)? If not, then will Ruby have a pluggable charset handler or some such? Will there be a String#charset? What about String#language (I vaguely remember each string instance in Parrot will be tagged with charset, encoding, and language).

me

said on 10 Jan 2005 at 14:09

isn’t charset==encoding ? could you elaborate on ruby-talk, maybe?

David Garamond

said on 14 Jan 2005 at 06:05

charset and encoding are two different concepts. unicode is a charset (the supposedly be-all and end-all over all charsets). unicode can be encoded in UTF -8, UTF -16, etc.

Manfred

said on 17 Jan 2006 at 03:12

An encoding implies a character set, so that isn’t really necessary.

mj1531

said on 27 Jan 2006 at 11:07

Just a thought, how about a String#convert_char that takes a block with 3 parameters, the encoding of the original char, the desired target encoding, and the character itself?

The block converts the string one character at a time, each time returning the converted character in the target encoding. The results of this block are concatenated to form the target string. This could be used internally to support the UTF -8 encoding.

However, there’s still the issue of the byte order mark…

Comments are closed for this entry.

Archive

Links

Syndicate

Futurism: Unicode In Ruby #

Simon

Simon

why

William Morgan

Simon

David Garamond

me

David Garamond

Manfred

mj1531