Closing in on Unicode with Jcode #
Patrick Hall has a great article on using the Jcode module for Ruby, which provides a more natural support for hacking Unicode strings. He has a few simple unit tests that illustrate failings in the Jcode library and leaves right there for us to glare at.
def test_reverse s = "Καλημέρα κόσμε!" srev = s.reverse assert_equal(s,srev) # fails end def test_index # String#index isn't Unicode-aware, it's counting bytes # there are ways aorund this, but... s = "Καλημέρα κόσμε!" assert_equal(0, s.index('Κ')) # passes assert_equal(1, s.index('α')) # fails! assert_equal(3, s.index('α')) # passes; 3rd byte! end
Sure, we’ll have all the answers in the future, but, for now, I’d say some patches to Jcode are in order. Or, to spirit up some Python mimickry:
class UString < String # Show u-prefix as in Python def inspect; "u#{ super }" end # Count multibyte characters def length; self.scan(/./).length end # Reverse the string def reverse; self.scan(/./).reverse.join end end module Kernel def u( str ) UString.new str.gsub(/U\+([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')} end end str = u"Ruby-語" str.length #=> 6 str.reverse #=> u"語-ybuR"
Anyway, Patrick’s blog is a great tour through easy digestable tidbits about Unicode. (Thanks, Jonas!)
Patrick Hall
Why, hello Why! Thanks for the linkage.
Chad Fowler was kind enough to point out the cluelessosity of my
Oh dear… I fear I’ve been escaped. Well anyway, it’s fixed in my post now.test_reverse
... I could plead cut & paste idiocy, but instead I’ll just fix it:PS. What ever happened to the timid foxfaced girl? I have lost sleep, I tell you.
I
Why: Your poignant guide is awesome! I accidentally learned ruby while reading it, however…
Anyway, in Chap. 5, you conflate the class names WishScanner and MindScanner.
Also, I defeated Dwemthy’s array with 1 rabbit by doing
initially, and whenever I needed a health boost :-D. Yes, you can eat lettuce and poop on yourself for fun and profit! :-D
Rimantas
How about downcase/upcase?
julik
if you are still curious, I am slowly hacking my way through this here
http://julik.textdriven.com/svn/tools/ rails_plugins/unicode_hacks/lib/unicode_hacks.rb
Primarily because I don’t care when that stuff is going to be in the ruby core. I am developing UTF apps now and I need it to work now. And subclassing and flagging is all broken because then every programmer on earth who doesn’t speak some non-latin language will just skim on it and use the Usual String Of Bytes and instead of producing bad text yourself you will be delegating it to others.
You need a gem for this to work because otherwise I would end up storing the whole Han table in pure Ruby. That’s alot of’em characters.
Comments are closed for this entry.