Nikolai's UTF-8 Lib is All Ready #
Last week, in the comments, Nikolai Weibull brought up his UTF-8 lib, a lovely creature which meets my own needs much better than what’s already out there. I like it much better than my own efforts. Especially now that he’s had some time to flesh it out.
Namely: It’s small. It’s coded in C. It locks into Ruby’s existing string class. Therefore, it can be efficient with memory and use Ruby’s own regexps.
require 'encoding/character/utf-8' str = u"hëllö" str.length #=> 5 str.reverse.length #=> 5 str[/ël/] #=> "ël"
If you’d like to follow development, clone this (git-web.) I’ve also put up a gem: gem install character-encodings --source code.whytheluckystiff.net
, but obviously it’s not an official release or anything.
FlashHater
Yey!! Oh wait, I’m an American, and thus have no use for international standards… >.<
Daniel Berger
I guess Tim Bray will have to update his slides for RubyConf 2006 now. :)
oliver
When installing the gem I get the following error on OS X . Any ideas?
Besides, this seems what I have been looking for a long time. :)13
oliver: Same here (using Ubuntu) :(
psmith
FlashHaterHater
Damn straight. Why do you think so many people in other countries speak English?
why
Okay, the gem is updated to turn off
-Werror
for now.psmith
why not
Why not just return what the function should??
At the end of remove_all_combining_dot_above(), either: return decomp_len; or: return 0;
...as appropriate (I believe that with -Werror off, it’ll return 0). Probably the latter, but I’m having trouble parsing the function, so I can’t be sure.
nil
but everyone will speak ruby as the offical language of the rubiverse! :)
sporkmonger
How is it at dealing with unicode normalization?
Manfred
Length returns the number of codepoints, not what we would think of as ‘characters’. Using NFC or NFKC doesn’t solve this either because there are a lot of characters which can’t be composed.
The solution for this is ‘grapheme clusters’, described in the unicode standard annex 29. The suggested implementation in the annex covers most of the characters used in everyday life.
It looks like Nicolai didn’t implement this.
nweibull
OK, to everyone on OS X , I’ve now updated the require to read
require 'encoding/character/utf-8/utf8'
, so Ruby should be able to figure out the extension for itself.Second, about the whole -Werror and -W stuff, sorry. For some reason, my compiler (gcc 3.4.6) didn’t report the “non-void without return” error. I’ll have to look into that. It’s good that I didn’t disable -Werror, however, as it was a horrible bug. The code has been fixed now. (Update, it’s the -std=c99 that does it. I have no explanation for this behavior – it seems that the c99 code generator will fill in the return instruction anyway – I’ll upgrade to 4.1 soon.)
About
#size
: it’s not overriden, so it’ll return the number of bytes in the string. I don’t know if that makes sense at all, but that’s how it currently works. Perhaps an addition of #byte_length makes more sense.sporkmonger: Good, although there’s no Ruby interface for it yet. I’ll add a method for normalization tomorrow.
Manfred: Ni/k/olai, if you please. And you are definitely right about “grapheme clusters”. It’s certainly something that is worth supporting. However, normalization does cover a lot of the everyday cases, so it’s not like we can’t do without “grapheme clusters” either.
I guess next on the list is to create a rubyforge project so that we can have a mailing list instead of discussing everything here.
oliver
Nikolai: Thanks, for doing the OS X bugfix.
I am looking forward to the rubyforge project.
Manfred
Nikolai: Oops, sorry I misspelled your name. I would love to see a mailing list to discuss some things (:
Manfred
Sorry, but I have to comment some more.
Normalizing a string is not enough to “cover a lot of the everyday cases” there are a lot of characters which can’t be composed. Nikolai’s utf-8 library doesn’t expose normalization yet, so I’m using an alternative library for this example:
Even though everybody would agree that this is one character, slicing between these codepoints will leave us with a different character than we started with. In German for some words the difference between singular and plural form is an Umlaut, if we chop this accent off with a broken slice we significantly change the meaning of the text.
The other problem with this solution is that it modifies the string methods, every ruby programs expects the length method to return the length in bytes. Consider this:
I’ve ran into this in the past myself and believe me, it wreaks havoc in Webrick.
nweibull
Manfred: There’s always
#size
, which remains unchanged. But please do come with a suggestion that makes both length easily accessible (and perhaps the third, the number of grapheme clusters…). I agree that overriding methods do have some negative consequences as well. Of course, the methods are overriden on a per-object basis, so in your example above, you, as a developer, should be aware that when you say thatstr
is a UTF -8-encoded string,#length
will not return the number of bytes instr
, but rather the number of codepoints instr
.All: There’s now a Rubyforge project set up for the character-encodings library.
There’s a mailing list, called char-encodings-development, which isn’t active yet. Hopefully Tom will get it set up sooner rather than later :-).
will
nweibull, intuitively #size would return the length in bytes and #length should return the ‘character’ length for me.
Boris K
This is great work. Thank you Nikolai, thank you _why.
Object
MenTaLguY
Nikolai: The problem really isn’t that you’ve got to convince ruby developers that
String#length
doesn’t mean the same thing as it normally does, it’s that you’ve potentially got to convince every piece of Ruby code ever written.I can see a change like this between, say, Ruby 1.8 and 1.9. I’ve got a harder time justifying it for a library. Does 1.9 at least have a similar change, by the way?
nweibull
MenTaLguY: Well, it all depends. I again stress that this is on a per-object level, so it won’t change anything unless you explicitly tell it to, which means that changing the meaning of
String#length
isn’t all that drastic after all. However, often when you mean when you sayString#width
, or perhaps you actually want to take grapheme clusters into account, so it’s all rather fuzzy as it is.MonkeeSage
It seems to me that the Right Thing to Do is to mixin methods to the String class globally rather than on a per-object basis, adding a “u” prefix (“ulength”, “ureverse”, &c).
Pros:There may be other pros/cons, but those are the ones I could think of.
nweibull
MonkeeSage: I don’t think that’s an appropriate solution, as I want to keep it as true to the idea of Strings in Rite as possible, i.e., every string has an encoding. The encoding can be accessed through #encoding, and I do think that it would be worth-wile to be able to change the encoding on the fly for any given string. That way you can easily switch between treating a string as a sequence of bytes, and something more advanced such as UTF -8, UTF -16, or some such. That way I think one solves the problem without changing the interface per se.
Remember, the only reason #length returns the length of the string in bytes is because that’s the way it is currently implemented. Having to live with such a restriction for all eternity seems rather silly.
MonkeeSage
Nikolai: I think aiming for Rite compatibility is probably a good idea. Personally though, I’m still not sure I like the idea of an encoding associated with every string by default, even if it is adopted by Matz for Ruby2 (not much I can do about that though!).
IMO , all strings should be thought of simply as groups of bytes on the basic class level (and offer byte-level access). The where and what of those groupings and so forth should come in at the level of manipulating them, i.e., through methods or subclasses or modules. It would be very easy to have, e.g., EncodedString < String, which you have to initialize with a method that takes an encoding and has methods for manipulating and translating encoded strings and adds an encoding attr.
Anyhow, implementation gripes aside, I appreciate the work you’re doing, please keep it up!
why
MonkeeSage: It seems like the distinction between your approach and Nikolai’s is really very minor. Nikolai is storing a byte string underneath it all.
But, yeah, Nikolai’s storming right into the class and overriding all kinds of methods. I don’t know if Matz wants
length
andsize
to be different, but it makes pretty good sense to me.MonkeeSage
I suppose I could always do something like:
:)
MonkeeSage
err… hëllö -> hëllö
MonkeeSage
one more time…
hëllö
->hëllö
why
Yeah, see, that’s the spirit, MonkeeSage.
MonkeeSage
This is better…
MonkeeSage
Ok, so now I’m DRY , I’m meta-tacular, and GC friendly (all thanks to redhanded)...
...now if I could just figure out what the heck an eigen is, I could make matrices and vectors and classes out of it…it would be like figuring out the secret recipe of the fluff in the fluffernutter…but that’s for another day.
Nikolai: I’m having trouble building off the latest head (are thre still heads and branches in git terminology?). I’m going to join the rubyforge list and post details there.