Futurism: Unicode In Ruby #
When asked about the future of Unicode in Ruby 1.9/2.0, Matz replied to Ruby-Core with the following laundry list of features he expects in Ruby’s multibyte character support:
- characters are represented by single character strings.
- so that
"abc"[0]
returns"a"
instead of fixnum 97. - all string methods are aware of multibyte characters.
- new method
String#encoding
gives character encoding name (e.g."utf-8"
). - new method
IO#encoding
gives character encoding name for reading data. - new method
IO#encoding=
sets the character encoding for reading data.
A library which emulates this could be built, based on Ruby’s current iconv
lib. Anybody want to take a stab at it?
Simon
do we have a time frame for the release of ruby 2? Has Matz finished with the 1.8 release now I wonder?
Simon
By the way, Why, how is chapter 6 of the poignant guide shaping up? Christmas has been and gone, you know…
why
What exactly can I say to placate you? I can’t have you in despair.
William Morgan
I would.
Simon
Hows about ‘chapter 6 is being uploaded now.’ That’d do it.
David Garamond
what about different charsets? Are Ruby strings going to be stored as Unicode (I assume not)? If not, then will Ruby have a pluggable charset handler or some such? Will there be a String#charset? What about String#language (I vaguely remember each string instance in Parrot will be tagged with charset, encoding, and language).
me
isn’t charset==encoding ? could you elaborate on ruby-talk, maybe?
David Garamond
charset and encoding are two different concepts. unicode is a charset (the supposedly be-all and end-all over all charsets). unicode can be encoded in UTF -8, UTF -16, etc.
Manfred
An encoding implies a character set, so that isn’t really necessary.
mj1531
Just a thought, how about a String#convert_char that takes a block with 3 parameters, the encoding of the original char, the desired target encoding, and the character itself?
The block converts the string one character at a time, each time returning the converted character in the target encoding. The results of this block are concatenated to form the target string. This could be used internally to support the UTF -8 encoding.
However, there’s still the issue of the byte order mark…
Comments are closed for this entry.