hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Tell Me the Hashing Dangers #

by why in bits
   class Symbol
     def hash; to_s.hash; end
   end

Someone explain why not.

said on 27 Feb 2007 at 12:15

Because one of the points of using a Symbol is that you don’t unnecessarily construct Strings.

said on 27 Feb 2007 at 12:21

I read the title wrong. I thought you wanted someone to tell you the Dashing Hangers… whatever that means…

Then, to make things worse, i thought by ‘hash’ you meant data-structure w/ key-value pairs

Thank goodness irb can help me see what you’re talking about!

But I have no explanation as to why not. I presume that any answer must explain how that number is created in the first place.

said on 27 Feb 2007 at 12:49

Because Symbol already has a hash method (which returns the same as object_id, granted)?

said on 27 Feb 2007 at 12:51

Seems like the problem is that “a”.hash == :a.hash but “a” and :a are not necessarily equivalent.

said on 27 Feb 2007 at 12:53

The current symbol hashing system is crazy and produces an awful distribution but looking up the string and hashing that is pretty expensive. Can we get a better distribution through some magic shuffling like a hash but on a number without having to resort to getting a string for it?

said on 27 Feb 2007 at 12:57

hm, I can only think of collisions of “a” and :a being certain :-(

said on 27 Feb 2007 at 12:58

Evan: Okay. So let’s assume that this is ported to C and no extra string is made.

Peter Cooper: I’m thinking in terms of HashWithIndifferentAccess.

Gaffers:

 >> a = {'c' => 1}
 => {"c"=>1}
 >> b = {'c' => 1}
 => {"c"=>1}
 >> a == b
 => true
 >> a.hash == b.hash
 => false
said on 27 Feb 2007 at 13:10

Creating extra strings… well that could be avoided in the C code possibly adding some table which caches the internal string’s hash value.

So what this really does it ask whether or not we should use the string hashing algorithm or some symbol id based one (from my POV ). Do hash collisions between strings and symbols really matter? What other uses are there for the hash value? Is there a better way to solve sub-optimal hashing of symbols?

Of course, at the end of all this thought, I still would be asking the same question as _why?

said on 27 Feb 2007 at 13:23

Nice. Why not as in Why shouldn’t we do it? Or why not as in Why doesn’t it work already. Need to dump that pesky collision resolver, but Ruby#hash doesn’t get called by [] it seems. DeepInTheC#hash does.

>> class Symbol
>>   def hash; puts "hey!"; self.to_s.hash; end
>> end
=> nil
>> h[:big]
=> nil
>> :big.hash
hey!
=> 842031597

Maybe I’m all off-base.

said on 27 Feb 2007 at 13:54

Given that Symbols are unique by definition, why can’t the object ID be the hash?

said on 27 Feb 2007 at 14:39

As noted earlier, using the object ID results in a poor distribution of hashes—but, this isn’t something to address only at Symbol!

Object#hash itself needs to be fixed: it should compute a hash of the object ID using some hash function, rather than just passing the object ID on through.

said on 27 Feb 2007 at 14:53

Kevin: I wondered that, but then I realized different architectures (or even machines) might give different answers.. and.. they do. On OS X I get :x.hash 180578 .. on Linux I get :x.hash 159938. This could cause problems in inter-machine situations.

said on 27 Feb 2007 at 14:55

Oh, yeah, and not just Object#hash needs to be fixed…


static int
rb_any_hash(a)
    VALUE a;
{
    VALUE hval;

    switch (TYPE(a)) {
      case T_FIXNUM:
      case T_SYMBOL:
        return (int)a;
        break;

      case T_STRING:
        return rb_str_hash(a);
        break;

      default:
        hval = rb_funcall(a, id_hash, 0);
        if (!FIXNUM_P(hval)) {
            hval = rb_funcall(hval, '%', 1, INT2FIX(536870923));
        }
        return (int)FIX2LONG(hval);
    }
}

Poorly distributed hashes make baby Knuth cry.

said on 27 Feb 2007 at 14:59

why not do the opposite?

class String def hash; intern.hash; end end

said on 27 Feb 2007 at 15:06

I thought _why wanted a HashWithIndifferentAccess that ran closer to the metal. But everyone went off talking about distribution optimizations.

WHO IS RIGHT .

said on 27 Feb 2007 at 15:12

it seems like using the object ID would be subject to the same if not more convoluted error as Gaffers mentioned above as in :a and 12412 would hash to the same location.

said on 27 Feb 2007 at 15:22

spork: colliding hashes is acceptable, as long as it’s not a frequent occurrence. The rule is equal objects have the same hash, not hash determines equality.

said on 27 Feb 2007 at 15:59

evan: I don’t know, once we noticed that Object#hash, Fixnum#hash, Symbol#hash, and rb_any_hash actually really, really suck, then I guess those became a slightly more pressing issue.

said on 27 Feb 2007 at 17:17

thanks for pointing out the error in my train of thought kevin.

To follow-up I’m not sure why people are saying that there would then be a poor distribution of hash codes? You are expecting a lot of collisions on :name and ‘name’? It seems like object id’s would be pretty nice for giving an even distribution of hash codes. What am I missing?

1, 2, 3, 4, ... (seem like good hash codes)

said on 27 Feb 2007 at 18:58

spork: The bucket is selected by hash % number_of_buckets; if there is any non-randomness in the distribution of hashes at an interval with common factors to the number of buckets, then you will get increased collisions as hashes map are more likely to map to certain buckets.

Many hash table implementations, Ruby’s included, round their table sizes up to prime numbers in order to minimize this effect, but it can’t be entirely eliminated unless the hashes are random (i.e. they are produced via a good hash function).

said on 28 Feb 2007 at 07:30

Why not? Because it won’t work.

If one wanted HashWithIndifferentAccess at the language level, you would also need :a == ‘a’, because a hash is only a bucket, not an identity. So Gaffers is right.

said on 28 Feb 2007 at 08:54

Thanks rogojin for pointing that out.

And the who makes them eql? will get some slapping from me.

Symbols are symbols and strings are strings, it’s not that hard, is it?

said on 28 Feb 2007 at 17:37

Sebastian Delmont: Why not the opposite?

class String def hash; intern.hash; end end

Because this introduces a space leak as a side effect of calling String#hash.

said on 28 Feb 2007 at 19:14

rogojin: If hashing is the same for both, wouldn’t it make HashWithIndifferentAccess more efficient? Of course, you would need other code to go along with it. It might be missing something here, but that’s how I see it.

said on 01 Mar 2007 at 14:36

rcorsaro: What do you mean by ‘hashing is the same for both’?

said on 01 Mar 2007 at 19:51

gaffers, rogojin:

two objects that have the same hash code don’t have to be equal.. your just getting a hash collision and lookup is slower

said on 02 Mar 2007 at 04:56

If you want symbols to have a better hash algorithm and to also cache them, that’s fine. But don’t make them match strings, because they’re not strings.

said on 02 Mar 2007 at 09:04

Why,

No reason not to , unless you mind the .02% collision rate of String.hash .

Unless there’s something wrong with this code, Symbol.hash dont collide.

<typo:code lang="ruby">
def random_string
  # http://snippets.dzone.com/posts/show/2111
    str = ''
    (rand(10) + 1).times do
       str <<  (i = Kernel.rand(62); i += ((i < 10) ? 48 : ((i < 36) ? 55 : 61 ))).chr 
    end
    str
end

# String 
string_collisions = 0; symbol_collisions = 0;
string_bucket={}; symbol_bucket = {}; 
1000000.times do 
  str = random_string
  if string_bucket[str.hash] && string_bucket[str.hash] != str  
    string_collisions += 1; 
    p "string collision at #{str} with #{string_bucket[str.hash]}";
  end
  if symbol_bucket[str.to_sym.hash] && symbol_bucket[str.to_sym.hash] != str  
    symbol_collisions += 1;
    p "symbol collision at #{str} with #{symbol_bucket[str.to_sym.hash]}"; 
  end
  string_bucket [str.hash] = str
  symbol_bucket[str.to_sym.hash] = str
end

p "String Collisions: #{string_collisions}",  "Symbol Collisions: #{symbol_collisions}" 
p "Combined string and Symbol Collisions: #{(string_bucket.keys &  symbol_bucket.keys).size}" 

(string_bucket.keys &  symbol_bucket.keys).each do |key|
  p "Key #{key} is the same for string #{string_bucket[key]} and symbol #{symbol_bucket[key]}" 
end

</typo:code>
said on 02 Mar 2007 at 13:34

Blake, I should point out that hash collisions are also a function of the size of the hash table. So you should be using the generated keys modulus 1000000 (or the next prime number up, etc) in this case.

said on 02 Mar 2007 at 20:13

MenTaLguY: Do you have a better implementation of rb_any_hash? Improving performance of large hashes is only going to get more important as time goes on.

Peter Cooper:
$ ruby -e 'puts :x.hash'
205628
$ ruby -e 'puts :x.hash'
205628
$ ruby -e ':y; puts :x.hash'
205788
$ ruby -e ':y; puts :x.hash; File.open("dump.out", "w") {|f| f.write(Marshal.dump(:x)) }'
205788
$ ruby -e 'puts Marshal.load(File.read("dump.out")).hash'
205628
said on 03 Mar 2007 at 19:42

class symbol
def hash; (x + to_s + x).hash; end
end

# where x is an obscure unicode character
# in many circumstances a bounded memoization
# of this would be an effective optimization

# i'll have a lot of free time in about a week
# i might try to put together a gem or something
said on 03 Mar 2007 at 19:43

class Symbol
def hash; (x + to_s + x).hash; end
end

# where x is an obscure unicode character
# in many circumstances a bounded memoization
# of this would be an effective optimization

# i'll have a lot of free time in about a week
# i might try to put together a gem or something
said on 05 Mar 2007 at 11:34

Carter, if you’re going to override Symbol#hash, you’d be much, MUCH better off simply running Symbol#object_id through a hash function. But it’s futile anyway, since the hash “computation” for Symbols is hard-coded in rb_any_hash, in C.

Looking at g_int_hash, however, in glib, I’m noticing that they no longer bother to run ints through a has function anymore (IIRC they did in older versions). I wonder whether it was an issue of not being able to find a suitable, yet still reasonably efficient, hash function? That’d kind of surprise me, though…

said on 06 Mar 2007 at 03:44

some hash function algs are here http://www.partow.net/downloads/GeneralHashFunctions_-_Ruby.zip

said on 06 Mar 2007 at 09:31

Thanks MenTaLguY,

I’m afraid I don’t get it. where should I use the generated keys modulus ?

I was going from this collision probability tutorial

It was an eye opener for me that collisions re acceptable. That finally makes sense to me.

I didn’t do any performace tests, but it’s possible that Why’s code, with a small percentage of collisions, could be more efficent than the Symbol.hash method that gets no collisions.

said on 06 Mar 2007 at 10:26

where should I use the generated keys modulus?

Look at the “mod N” in the collision probability tutorial you linked to. N is the number of buckets in the hash table, which is determined from the number of ( key, value ) pairs stored in it. Ruby’s implementation in particular rounds the number of buckets up to the next prime (from a list of primes).

said on 06 Mar 2007 at 10:38

Surprisingly, having just measured it, it looks like in many practical cases, not messing with int/symbol hashes works pretty well. I guess that’s the reason behind the present g_int_hash and rb_any_hash.

said on 16 Mar 2007 at 15:50

This is so weird, there’s another spork around. I wonder if I’m allowed to be the evil twin.

11 Jul 2010 at 20:46

* do fancy stuff in your comment.

PREVIEW PANE