Mucking With Unicode for 1.8 #
The idea here with this little project is to enhance the strings in Ruby 1.8 to support encodings. Following the plan of Matz and without breaking extensions and still allowing raw strings.
For now, you have to specify the encoding when you create the string:
>> str = utf8("色は匂へど 散りぬるを") >> str[1,2] => は匂 >> str[/散(.{2})/u, 1] => りぬ
I can’t use wchar
, since I’m adding onto the RString
class, which stores the raw bytes in RSTRING(str)->ptr
. And I’ve got to hook into Ruby’s regexps, can’t ignore that. So, instead, I’ve added an indexed array of character lengths. I’m not suggesting this is the answer, but consider that we have so little out there. When the string initially gets stored, it gets validated against the rules for the encoding and all the character sizes get stored.
>> require 'wordy' >> utf8("ვეპხის ტყაოსანი").index_s => [3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3]
The index_s
method gives a list of the byte sizes for each character. I only support UTF-8 presently.
The speed is pretty good. Creating new strings, adding string and dup’ing strings end up being generally just as fast as builtin strings. Substrings and slicing don’t compare, though. But not much additional memory is used. One 4-byte index is used for every 16 characters. So, it’s about 20-25% over the raw string.
The repository is here. I could use some help finding a replacement for bitcopy
, which is like a memcpy
with bit offsets. The one I’m using is fast but buggy.
Wait, uh: I’m going to hold off an watch what Nikolai is doing here.
Thijs
There’s allready an initiative to get some unicode support. It seems like they’re trying to achieve the same goal:
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
It will be put in core for 1.2:
http://www.ruby-forum.com/topic/72893#new
Qerub
nornagon
+1, Qerub. Make this hack as simple to use as possible.
Julik
So basically you made the 15th Unicode string implementation for Ruby. Without normalisation.
Can’t name it the time well spent, although the effort is noble.
Check out ICU4R for something four times as complete (and about 50 times as large).
murphy
interesting!
why
Thijs: This one’s in C.
Qerub: Cool!
Julik: This one’s in C.
murphy
@why: just a thought. I don’t know much about Unicode implementations.
If you store the index of the char, rather the it’s length, so that index_s in your above example would be:
you canbut it seems more natural to me. the old linked-list-versus-pointer-array problem.
why
If you do that, though, then you have to store an
unsigned long
for every character. So you might as well just store a pointer to that character. I went that route at first, just while playing around, but you don’t really loose that much speed with the offsets.On big documents sure, but
wordy_charpos
also will compute the offset from the end of the list (or a supplied pointer from a previous search,) whichever is most efficient.nweibull
“the old linked-list-versus-pointer-array problem”
which is usually solved by using some form of tree structure with O(lg n) access times instead.
Either way, storing indexes is hardly the way to deal with UTF -8. Then you might as well use ICU4R as turning the string into UTF -16 should be about as time-consumirg in the first run as this index calculating stuff.
Anyway, I have a library that’s still under heavy development, but it does do all the mostly used stuff.
You can check it out at my git repository.
What it does is mixes in methods into the String class, using the u”...” notation _why previously demonstrated. I guess using +”...” is better, though, as it will work in all cases.
Anyway, it doesn’t do anything fancy. It just treats the contents of the String (or RString) as a sequence of bytes. The only thing I might like to add to RString would be a char_len field (or something similarly named) that keeps track of the number of “characters” in the string, not the number of bytes. This would help with index checking.
What’s important, however, is that it works.
To give credit where credit is due: The actual utf-8 handling is heavily based on/borrowed/stolen code from the glib library.
The library is currently based on Unicode 4.1, but as Unicode 5.0 got out of beta today (!), there will hopefully be support for 5.0 soon.
why
For anyone who wants to try nweibull’s lib:
Still, you’re using the same general idea with the decomposition table. Hey, this is great stuff! I was really hoping someone would pop up with something better. Is there something special I need to do to get this to compile?
nweibull
The decomposition stuff is only used for normalization – or am I missing something?
To compile:
Hopefully that’ll work. You’ll probably need gcc , or at least a compiler that understands C99 . If someone wants to backport it to C89 that’s fine, but I really don’t think C is fun enough to limit myself to the dated rules of C89 .
Also, this is taken from an earlier project where the code was used for a text editor I was writing and since I’ve been in programming mode, not packacking & maintenance mode the previous week, the library will install as ‘ned/unicode’. I’m thinking that simply calling the library ‘utf8’ or ‘now/utf8’ (I’m trying to put all my libraries in the ‘now’ “namespace”) will do fine.
Running rake will run the rspec-based tests (which are far too few in number).
Anyway, my long-term goal is to make the string processing as generic as possible so that other encodings can be supported (like Oniguruma does today). That way, most of the code that will be needed for Rite will already exist by the time that stuff is going to be implemented.
However, more hints on exactly how Matz wants Strings to work in Ruby 2.0 will be necessary.
Julik
Ok, so now we have about 4-5 implementations in C/Ruby for doing the same thing. Cool.
Julik
Just to count – Unicode gem, utf8_proc, the two implementations mentioned in this entry and ICU4R .
I’m all for diversity but maybe you guys can join forces? ICU4R is in C, _why. And it will be most likely times faster than other stuff people come up with in a hackish way.
why
nweibull: Okay, I’m getting an error on FreeBSD, some conflict with the
index
method definition in/usr/include/strings.h
. I’ll play with it.asdf
nweibull
Julik: Here’s a breakdown of the libraries that exist so far:
Yoshida Masato’s Unicode library only does normalization and case conversion.
UTF8Proc seems to do even less, by only doing normalization. I may be looking at an old (0.2) version, though.
The unicode_hacks plugin for RoR does quite a lot, but in a very inefficient way, as it is written in Ruby (a lot of allocation work is done, which can be avoided in C).
ICU4R uses ICU , which is a fanstastic Unicode library (probably the most complete there is). The problem is, ICU uses UTF -16 internally, and this is definitely not always what you want.
I can’t say much about _why’s library, but it looks like it can do some nice stuff, even though it’s immature.
About my own library: It layers “flawlessly” over Ruby’s own String class, so there’s no conversion going on, there’s no extra allocation necessary, there’s no speed decrease (or increase for that matter) of operations that don’t deal with UTF -8, it does do all the stuff you would like to do. It isn’t a hack. It is, however, a work in progress, and some methods are still not implemented. Also, error checking is missing in many places, so feeding it illegal UTF -8-sequences may blow your Ruby session.
why: Ouch, typical – I use
index
as a variable name in quite a few places, and it seems like your compiler has issues. Usingindex
as the name of a local variable should be acceptable, but I’ll try to come up with better names for my variables.asdf: Seeing as how the number of possible code points are limited in Unicode, that day may yet come.
why
nweibull: Okay, after renaming index and strnstr, I’ve got it compiled. The specs don’t run for me, since there’s no
Kernel#u
.It looks like
Kernel#u
should be:I really like what you’ve got here. I want to play with it.
sporkmonger
I don’t suppose anyone has an implementation of stringprep/nameprep in Ruby floating around?
nweibull
why: That’s weird. Are you sure unicode.rb is getting loaded? It, in turn, requires unicode.so and sets up the bindings.
Kernel#u
should bedef u(str); str.extend(UTFMethods); end
and it is defined in unicode.rb. I wonder why this isn’t working for you…Perhaps running the specifications without having the library installed works? Although, the
$LOAD_PATH.unshift '..'
should make sure that everything works correctly inside the specifications.Anyway, I’m allocating tomorrow to clean up the file structure and make sure that compilation, testing, and installation work better than they do now.
FlashHater
Julik: _why is writing it, so it’ll be 10x greater than all the other libraries!!! :D
murphy
oh yes, that’s true. didn’t thought about that.
I’m glad you are addressing two of the most popular con’s about Ruby: bad Unicode support and lack of speed.
hgs
I’ve been reading Lean Software Development: An Agile Toolkit and this advocates deferring decisions where practical, because things change. It also advocates havine multiple implementations of something, because when the groups get together you can breed something wonderful from the diverse gene pool. This does require the groups to talk, though.
Julik
At the moment we want to be a) interoperable with others b) without nasty subclassing (what you have) c) without compiling stuff.
I think we nailed the compromises pretty well. When your extension is ready we will try to implement a handler for it.
As for the inefficiency – well, do better without C. Manfred will be eager to hear your suggestions. And do better when some strings you process might not be Unicode strings either.
Julik
In all fairness – if you want to flex your C muscle just pick up ICU4R and finish it.
Julik
Besides,
git clone http://git.bitwi.se/unicode.git/ Cannot get remote repository information. Perhaps git-update-server-info needs to be run there?
git clone git://git.bitwi.se/unicode.git/ fatal: unable to connect a socket (Connection refused) fetch-pack from ‘git://git.bitwi.se/unicode.git/’ failed.
nweibull
Julik: If you’re addressing me above: a) remains true, b) isn’t true, c) eh?
There’s no subclassing going on here. All that happens is that the string is extended with a module.
Again, I am not here to replace someone else’s solution or claim that my solution is the last word in character encoding handling.
I’ve already stated why ICU4R isn’t a solution for me. I don’t want my strings decoded, encoded, and re-encoded. I want them to remain a sequence of bytes.
Strange that cg-clone worked. I forgot to chmod post-update.
Anyway, there’s a new repository, which you can clone:
Julik
Nikolai, what I meant by “subclassing” is _why’s method of doing u”bla” (who’s going to do u”bla” on the CGI params and the database and the sockets…), I confused somewhat. Especially considering I couldn’t look at your extension because of these bizarro git things I’m barely familiar with.
The fact that ICU4R uses separate strings and regexps is a hindrance. but ICU implements alot of very very fancy Unicode mechanics (locale-aware sentence boundaries? word boundaries? locale-independent grapheme clusters?). The problem is that making your own will oblige you to implement all (or some of these) by yourself. OTOH , ICU4R is abandoned now and you can easily rig it into Ruby’s String (which will make it much more useable).
We did the mixing in of methods somewhat differently (making a wrapper around Strings with same API , but preserving the original string). We also do native regexps (with character offsets) and gsub – basically all that String does.
All I hope for is we can make a handler for our plugin with your extension when it’s ready. And when it builds and runs on OS X without diffs, for that matter :-)
nweibull
Julik: Dealing with grapheme clusters and various boundaries is of course awesome, so hopefully I or someone else (hint, hint) will write code for that as well.
I don’t own a Mac, and I don’t run FreeBSD or Windows, so someone will have to help me with the OS porting.
Anyway, all I hope now is that I didn’t announce this library too soon.
Current task is to write up specifications so that we can ensure backwards compatibility while supporting unicode as well.