Because one of the points of using a Symbol is that you don’t unnecessarily construct Strings.
roberthahn
said on
27 Feb 2007 at 12:21
I read the title wrong. I thought you wanted someone to tell you the Dashing Hangers… whatever that means…
Then, to make things worse, i thought by ‘hash’ you meant data-structure w/ key-value pairs
Thank goodness irb can help me see what you’re talking about!
But I have no explanation as to why not. I presume that any answer must explain how that number is created in the first place.
Peter Cooper
said on
27 Feb 2007 at 12:49
Because Symbol already has a hash method (which returns the same as object_id, granted)?
Gaffers
said on
27 Feb 2007 at 12:51
Seems like the problem is that “a”.hash == :a.hash but “a” and :a are not necessarily equivalent.
Tim
said on
27 Feb 2007 at 12:53
The current symbol hashing system is crazy and produces an awful distribution but looking up the string and hashing that is pretty expensive. Can we get a better distribution through some magic shuffling like a hash but on a number without having to resort to getting a string for it?
muro
said on
27 Feb 2007 at 12:57
hm, I can only think of collisions of “a” and :a being certain :-(
why
said on
27 Feb 2007 at 12:58
Evan: Okay. So let’s assume that this is ported to C and no extra string is made.
Peter Cooper: I’m thinking in terms of HashWithIndifferentAccess.
Gaffers:
>> a = {'c' => 1}
=> {"c"=>1}
>> b = {'c' => 1}
=> {"c"=>1}
>> a == b
=> true
>> a.hash == b.hash
=> false
24yranib
said on
27 Feb 2007 at 13:10
Creating extra strings… well that could be avoided in the C code possibly adding some table which caches the internal string’s hash value.
So what this really does it ask whether or not we should use the string hashing algorithm or some symbol id based one (from my POV ). Do hash collisions between strings and symbols really matter? What other uses are there for the hash value? Is there a better way to solve sub-optimal hashing of symbols?
Of course, at the end of all this thought, I still would be asking the same question as _why?
evan
said on
27 Feb 2007 at 13:23
Nice. Why not as in Why shouldn’t we do it? Or why not as in Why doesn’t it work already. Need to dump that pesky collision resolver, but Ruby#hash doesn’t get called by [] it seems. DeepInTheC#hash does.
>> class Symbol
>> def hash; puts "hey!"; self.to_s.hash; end
>> end
=> nil
>> h[:big]
=> nil
>> :big.hash
hey!
=> 842031597
Maybe I’m all off-base.
Kevin Ballard
said on
27 Feb 2007 at 13:54
Given that Symbols are unique by definition, why can’t the object ID be the hash?
MenTaLguY
said on
27 Feb 2007 at 14:39
As noted earlier, using the object ID results in a poor distribution of hashes—but, this isn’t something to address only at Symbol!
Object#hash itself needs to be fixed: it should compute a hash of the object ID using some hash function, rather than just passing the object ID on through.
Peter Cooper
said on
27 Feb 2007 at 14:53
Kevin: I wondered that, but then I realized different architectures (or even machines) might give different answers.. and.. they do. On OS X I get :x.hash 180578 .. on Linux I get :x.hash 159938. This could cause problems in inter-machine situations.
MenTaLguY
said on
27 Feb 2007 at 14:55
Oh, yeah, and not justObject#hash needs to be fixed…
static int
rb_any_hash(a)
VALUE a;
{
VALUE hval;
switch (TYPE(a)) {
case T_FIXNUM:
case T_SYMBOL:
return (int)a;
break;
case T_STRING:
return rb_str_hash(a);
break;
default:
hval = rb_funcall(a, id_hash, 0);
if (!FIXNUM_P(hval)) {
hval = rb_funcall(hval, '%', 1, INT2FIX(536870923));
}
return (int)FIX2LONG(hval);
}
}
Poorly distributed hashes make baby Knuth cry.
Sebastian Delmont
said on
27 Feb 2007 at 14:59
why not do the opposite?
class String
def hash; intern.hash; end
end
evan
said on
27 Feb 2007 at 15:06
I thought _why wanted a HashWithIndifferentAccess that ran closer to the metal. But everyone went off talking about distribution optimizations.
WHO IS RIGHT .
spork
said on
27 Feb 2007 at 15:12
it seems like using the object ID would be subject to the same if not more convoluted error as Gaffers mentioned above as in :a and 12412 would hash to the same location.
Kevin Ballard
said on
27 Feb 2007 at 15:22
spork: colliding hashes is acceptable, as long as it’s not a frequent occurrence. The rule is equal objects have the same hash, not hash determines equality.
MenTaLguY
said on
27 Feb 2007 at 15:59
evan: I don’t know, once we noticed that Object#hash, Fixnum#hash, Symbol#hash, and rb_any_hash actually really, really suck, then I guess those became a slightly more pressing issue.
spork
said on
27 Feb 2007 at 17:17
thanks for pointing out the error in my train of thought kevin.
To follow-up I’m not sure why people are saying that there would then be a poor distribution of hash codes? You are expecting a lot of collisions on :name and ‘name’? It seems like object id’s would be pretty nice for giving an even distribution of hash codes. What am I missing?
1, 2, 3, 4, ... (seem like good hash codes)
MenTaLguY
said on
27 Feb 2007 at 18:58
spork: The bucket is selected by hash % number_of_buckets; if there is any non-randomness in the distribution of hashes at an interval with common factors to the number of buckets, then you will get increased collisions as hashes map are more likely to map to certain buckets.
Many hash table implementations, Ruby’s included, round their table sizes up to prime numbers in order to minimize this effect, but it can’t be entirely eliminated unless the hashes are random (i.e. they are produced via a good hash function).
rogojin
said on
28 Feb 2007 at 07:30
Why not? Because it won’t work.
If one wanted HashWithIndifferentAccess at the language level, you would also need :a == ‘a’, because a hash is only a bucket, not an identity. So Gaffers is right.
chris2
said on
28 Feb 2007 at 08:54
Thanks rogojin for pointing that out.
And the who makes them eql? will get some slapping from me.
Symbols are symbols and strings are strings, it’s not that hard, is it?
John Whitley
said on
28 Feb 2007 at 17:37
Sebastian Delmont: Why not the opposite?
class String def hash; intern.hash; end end
Because this introduces a space leak as a side effect of calling String#hash.
rcorsaro
said on
28 Feb 2007 at 19:14
rogojin: If hashing is the same for both, wouldn’t it make HashWithIndifferentAccess more efficient? Of course, you would need other code to go along with it. It might be missing something here, but that’s how I see it.
rogojin
said on
01 Mar 2007 at 14:36
rcorsaro: What do you mean by ‘hashing is the same for both’?
gabe
said on
01 Mar 2007 at 19:51
gaffers, rogojin:
two objects that have the same hash code don’t have to be equal.. your just getting a hash collision and lookup is slower
headius
said on
02 Mar 2007 at 04:56
If you want symbols to have a better hash algorithm and to also cache them, that’s fine. But don’t make them match strings, because they’re not strings.
Blake
said on
02 Mar 2007 at 09:04
Why,
No reason not to , unless you mind the .02% collision rate of String.hash .
Unless there’s something wrong with this code, Symbol.hash dont collide.
<typo:code lang="ruby">
def random_string
# http://snippets.dzone.com/posts/show/2111
str = ''
(rand(10) + 1).times do
str << (i = Kernel.rand(62); i += ((i < 10) ? 48 : ((i < 36) ? 55 : 61 ))).chr
end
str
end
# String
string_collisions = 0; symbol_collisions = 0;
string_bucket={}; symbol_bucket = {};
1000000.times do
str = random_string
if string_bucket[str.hash] && string_bucket[str.hash] != str
string_collisions += 1;
p "string collision at #{str} with #{string_bucket[str.hash]}";
end
if symbol_bucket[str.to_sym.hash] && symbol_bucket[str.to_sym.hash] != str
symbol_collisions += 1;
p "symbol collision at #{str} with #{symbol_bucket[str.to_sym.hash]}";
end
string_bucket [str.hash] = str
symbol_bucket[str.to_sym.hash] = str
end
p "String Collisions: #{string_collisions}", "Symbol Collisions: #{symbol_collisions}"
p "Combined string and Symbol Collisions: #{(string_bucket.keys & symbol_bucket.keys).size}"
(string_bucket.keys & symbol_bucket.keys).each do |key|
p "Key #{key} is the same for string #{string_bucket[key]} and symbol #{symbol_bucket[key]}"
end
</typo:code>
MenTaLguY
said on
02 Mar 2007 at 13:34
Blake, I should point out that hash collisions are also a function of the size of the hash table. So you should be using the generated keys modulus 1000000 (or the next prime number up, etc) in this case.
djur
said on
02 Mar 2007 at 20:13
MenTaLguY: Do you have a better implementation of rb_any_hash? Improving performance of large hashes is only going to get more important as time goes on.
class symbol
def hash; (x + to_s + x).hash; end
end
# where x is an obscure unicode character
# in many circumstances a bounded memoization
# of this would be an effective optimization
# i'll have a lot of free time in about a week
# i might try to put together a gem or something
Carter
said on
03 Mar 2007 at 19:43
class Symbol
def hash; (x + to_s + x).hash; end
end
# where x is an obscure unicode character
# in many circumstances a bounded memoization
# of this would be an effective optimization
# i'll have a lot of free time in about a week
# i might try to put together a gem or something
MenTaLguY
said on
05 Mar 2007 at 11:34
Carter, if you’re going to override Symbol#hash, you’d be much, MUCH better off simply running Symbol#object_id through a hash function. But it’s futile anyway, since the hash “computation” for Symbols is hard-coded in rb_any_hash, in C.
Looking at g_int_hash, however, in glib, I’m noticing that they no longer bother to run ints through a has function anymore (IIRC they did in older versions). I wonder whether it was an issue of not being able to find a suitable, yet still reasonably efficient, hash function? That’d kind of surprise me, though…
teq
said on
06 Mar 2007 at 03:44
some hash function algs are here http://www.partow.net/downloads/GeneralHashFunctions_-_Ruby.zip
Blake
said on
06 Mar 2007 at 09:31
Thanks MenTaLguY,
I’m afraid I don’t get it. where should I use the generated keys modulus ?
It was an eye opener for me that collisions re acceptable. That finally makes sense to me.
I didn’t do any performace tests, but it’s possible that Why’s code, with a small percentage of collisions, could be more efficent than the Symbol.hash method that gets no collisions.
MenTaLguY
said on
06 Mar 2007 at 10:26
where should I use the generated keys modulus?
Look at the “mod N” in the collision probability tutorial you linked to. N is the number of buckets in the hash table, which is determined from the number of ( key, value ) pairs stored in it. Ruby’s implementation in particular rounds the number of buckets up to the next prime (from a list of primes).
MenTaLguY
said on
06 Mar 2007 at 10:38
Surprisingly, having just measured it, it looks like in many practical cases, not messing with int/symbol hashes works pretty well. I guess that’s the reason behind the present g_int_hash and rb_any_hash.
sporkmonger
said on
16 Mar 2007 at 15:50
This is so weird, there’s another spork around. I wonder if I’m allowed to be the evil twin.
Evan
Because one of the points of using a Symbol is that you don’t unnecessarily construct Strings.
roberthahn
I read the title wrong. I thought you wanted someone to tell you the Dashing Hangers… whatever that means…
Then, to make things worse, i thought by ‘hash’ you meant data-structure w/ key-value pairs
Thank goodness irb can help me see what you’re talking about!
But I have no explanation as to why not. I presume that any answer must explain how that number is created in the first place.
Peter Cooper
Because Symbol already has a hash method (which returns the same as object_id, granted)?
Gaffers
Seems like the problem is that “a”.hash == :a.hash but “a” and :a are not necessarily equivalent.
Tim
The current symbol hashing system is crazy and produces an awful distribution but looking up the string and hashing that is pretty expensive. Can we get a better distribution through some magic shuffling like a hash but on a number without having to resort to getting a string for it?
muro
hm, I can only think of collisions of “a” and :a being certain :-(
why
Evan: Okay. So let’s assume that this is ported to C and no extra string is made.
Peter Cooper: I’m thinking in terms of HashWithIndifferentAccess.
Gaffers:
24yranib
Creating extra strings… well that could be avoided in the C code possibly adding some table which caches the internal string’s hash value.
So what this really does it ask whether or not we should use the string hashing algorithm or some symbol id based one (from my POV ). Do hash collisions between strings and symbols really matter? What other uses are there for the hash value? Is there a better way to solve sub-optimal hashing of symbols?
Of course, at the end of all this thought, I still would be asking the same question as _why?
evan
Nice. Why not as in Why shouldn’t we do it? Or why not as in Why doesn’t it work already. Need to dump that pesky collision resolver, but Ruby#hash doesn’t get called by [] it seems. DeepInTheC#hash does.
Maybe I’m all off-base.
Kevin Ballard
Given that Symbols are unique by definition, why can’t the object ID be the hash?
MenTaLguY
As noted earlier, using the object ID results in a poor distribution of hashes—but, this isn’t something to address only at
Symbol
!Object#hash
itself needs to be fixed: it should compute a hash of the object ID using some hash function, rather than just passing the object ID on through.Peter Cooper
Kevin: I wondered that, but then I realized different architectures (or even machines) might give different answers.. and.. they do. On OS X I get :x.hash 180578 .. on Linux I get :x.hash 159938. This could cause problems in inter-machine situations.
MenTaLguY
Oh, yeah, and not just
Object#hash
needs to be fixed…Poorly distributed hashes make baby Knuth cry.
Sebastian Delmont
why not do the opposite?
class String def hash; intern.hash; end end
evan
I thought _why wanted a HashWithIndifferentAccess that ran closer to the metal. But everyone went off talking about distribution optimizations.
WHO IS RIGHT .
spork
it seems like using the object ID would be subject to the same if not more convoluted error as Gaffers mentioned above as in :a and 12412 would hash to the same location.
Kevin Ballard
spork: colliding hashes is acceptable, as long as it’s not a frequent occurrence. The rule is equal objects have the same hash, not hash determines equality.
MenTaLguY
evan: I don’t know, once we noticed that
Object#hash
,Fixnum#hash
,Symbol#hash
, andrb_any_hash
actually really, really suck, then I guess those became a slightly more pressing issue.spork
thanks for pointing out the error in my train of thought kevin.
To follow-up I’m not sure why people are saying that there would then be a poor distribution of hash codes? You are expecting a lot of collisions on :name and ‘name’? It seems like object id’s would be pretty nice for giving an even distribution of hash codes. What am I missing?
1, 2, 3, 4, ... (seem like good hash codes)
MenTaLguY
spork: The bucket is selected by
hash % number_of_buckets
; if there is any non-randomness in the distribution of hashes at an interval with common factors to the number of buckets, then you will get increased collisions as hashes map are more likely to map to certain buckets.Many hash table implementations, Ruby’s included, round their table sizes up to prime numbers in order to minimize this effect, but it can’t be entirely eliminated unless the hashes are random (i.e. they are produced via a good hash function).
rogojin
Why not? Because it won’t work.
If one wanted HashWithIndifferentAccess at the language level, you would also need :a == ‘a’, because a hash is only a bucket, not an identity. So Gaffers is right.
chris2
Thanks rogojin for pointing that out.
And the who makes them eql? will get some slapping from me.
Symbols are symbols and strings are strings, it’s not that hard, is it?
John Whitley
Sebastian Delmont: Why not the opposite?
class String def hash; intern.hash; end end
Because this introduces a space leak as a side effect of calling String#hash.
rcorsaro
rogojin: If hashing is the same for both, wouldn’t it make HashWithIndifferentAccess more efficient? Of course, you would need other code to go along with it. It might be missing something here, but that’s how I see it.
rogojin
rcorsaro: What do you mean by ‘hashing is the same for both’?
gabe
gaffers, rogojin:
two objects that have the same hash code don’t have to be equal.. your just getting a hash collision and lookup is slower
headius
If you want symbols to have a better hash algorithm and to also cache them, that’s fine. But don’t make them match strings, because they’re not strings.
Blake
Why,
No reason not to , unless you mind the .02% collision rate of String.hash .
Unless there’s something wrong with this code, Symbol.hash dont collide.
MenTaLguY
Blake, I should point out that hash collisions are also a function of the size of the hash table. So you should be using the generated keys modulus 1000000 (or the next prime number up, etc) in this case.
djur
MenTaLguY: Do you have a better implementation of rb_any_hash? Improving performance of large hashes is only going to get more important as time goes on.
Peter Cooper:Carter
Carter
MenTaLguY
Carter, if you’re going to override Symbol#hash, you’d be much, MUCH better off simply running Symbol#object_id through a hash function. But it’s futile anyway, since the hash “computation” for Symbols is hard-coded in rb_any_hash, in C.
Looking at g_int_hash, however, in glib, I’m noticing that they no longer bother to run ints through a has function anymore (IIRC they did in older versions). I wonder whether it was an issue of not being able to find a suitable, yet still reasonably efficient, hash function? That’d kind of surprise me, though…
teq
some hash function algs are here http://www.partow.net/downloads/GeneralHashFunctions_-_Ruby.zip
Blake
Thanks MenTaLguY,
I’m afraid I don’t get it. where should I use the generated keys modulus ?
I was going from this collision probability tutorial
It was an eye opener for me that collisions re acceptable. That finally makes sense to me.
I didn’t do any performace tests, but it’s possible that Why’s code, with a small percentage of collisions, could be more efficent than the Symbol.hash method that gets no collisions.
MenTaLguY
Look at the “mod N” in the collision probability tutorial you linked to. N is the number of buckets in the hash table, which is determined from the number of
( key, value )
pairs stored in it. Ruby’s implementation in particular rounds the number of buckets up to the next prime (from a list of primes).MenTaLguY
Surprisingly, having just measured it, it looks like in many practical cases, not messing with int/symbol hashes works pretty well. I guess that’s the reason behind the present
g_int_hash
andrb_any_hash
.sporkmonger
This is so weird, there’s another spork around. I wonder if I’m allowed to be the evil twin.