What is the simplest hash technique you would use for
storing file handles in a hash table?
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
What is the simplest hash technique you would use for
storing file handles in a hash table?
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
Is this is a rhetorical question?
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Rufoo
Sent: Tuesday, January 03, 2006 8:52 PM
To: Windows File Systems Devs Interest List
Subject: [ntfsd] Hash for file handles
What is the simplest hash technique you would use for
storing file handles in a hash table?
Yahoo! DSL - Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17
You are currently subscribed to ntfsd as: xxxxx@rocketdivision.com
To unsubscribe send a blank email to xxxxx@lists.osr.com
Generally, the goal with a good hash function is that you want good
“scatter”. You can probably get away with a reasonable simple hash
algorithm given that handles are uniformly distributed. The low order
bits are not used (alignment) so shift to get rid of them and then do
modulo arithmetic on the rest of them.
From a mathematical standpoint, normally prime numbers are preferable
(they form fields, which have nice properties that cause good “scatter”
and work well in linear hashing configurations) but from a computational
standpoint doing remainder division has traditionally been viewed as
expensive. If you expect to do 10k lookups per second, then you’ll be
concerned about that and may choose to forego using a field over a prime
number (and the nice properties that go along with working over a
field).
In the past, in constructing general purpose hash tables one approach we
used was to maintain a pre-computed list chosen from the first few
hundred prime numbers. We started with a number of hash buckets from
the lowest entry in the table. When the size of the hash table exceeded
the number of hash buckets by a significant percentage (e.g., 2x) then
we’d rebuild the hash table using the next bigger prime number. Thus,
if the number of entries was very small, the table would be small. As
the number of entries grew, the size of the table grew (although
obviously the size of the table grew logarithmically in size versus the
number of entries - think of it as mimicking the scatter of prime
numbers as one wanders off to infinity.) The hash value was actually
computed in two parts then: the first part computed a fixed size value
(e.g., 32 bit) based upon the data itself and was normally a function
pointer provided when the hash table was first initialized. The second
part of the computation then took this fixed size value and performed
the modulo arithmetic on it to find the actual hash bucket. Once the
correct bucket was identified we’d walk a doubly linked list of entries
to do full matching.
The advantage of this approach is that it approximates linear hashing,
but doesn’t involve a rebuild of the table on each insert/remove.
That’s important in MY experience because often entries are relatively
short-lived, so you don’t want to be changing the table size frequently.
We only rebuilt the table when it became inefficient in terms of size
(too big) or collisions (too small).
Schemes of this type provide very good scalability, although at the cost
of considerable implementation complexity. However, if you are trying
to do content addressing, this is a good scheme for achieving it.
The last time I implemented a hash table I was more concerned about
performance and knew the size of my data set was limited to a few
thousand entries - I didn’t want us to go rebuilding hash tables in the
middle of a benchmark (it would drag down the numbers) so I chose a
fixed size hash table, with the restriction it had to be a power of two.
I started with 256 entries.
*I* was hashing pointer values in memory and concluded that the upper
and lower bytes were the pieces with the LEAST variable data. So I took
the 2nd and 3rd bytes, XORed them together, and used the results to
identify the bucket. While not perfect, it is very fast to compute and
does a plausible job (I’ve looked and while not perfect, it gives a good
scatter, which is all I can ask for.)
But you say you are hashing handles, and by that I assume you mean
“object handles”. They have tremendous regularity in their scatter
(they are offsets in a table and linearly consumed, with reuse) and thus
a fancy hash function is not really going to buy you much. If you go
for a variable sized table, right shift 2 places (strip low order two
bits) and then do modulo arithmetic you should be fine. Power-of-two
modulo is a bit mask operation. Otherwise, you’re talking about integer
division. The perf difference is tiny unless you are doing a LOT of
these - but maybe it matters in your environment, in which case you’ll
need to pay attention to how you solve this problem.
So, let’s assume you believe you’ll need to store 100k handles in this
table. If you use a 1024 entry hash table, each table should then be
organized as a b-tree (or splay tree - there’s a splay tree package
built into Windows, and they have great locality characteristics.)
That’s simple AND efficient.
If you assume you are hashing up to 1024 handles, choose 1031 (the next
prime - see http://www.walter-fendt.de/m14e/primes.htm for a nice prime
number table that goes up as high as you are likely to need in your
lifetime.) Then read Knuth about hash tables (pick the first entry,
then if that’s not your entry, look at the 2nd in the series, etc.)
Bottom line: the simplicity of the hash table implementation you use
depends more on the size of the data set being hashed than anything
else. If it is a small set, use a flat array. When it goes bigger, use
a flat array of linked list structure. As it grows beyond that, use the
O(log N) properties of b-trees combined with that flat array. If you
just want a simple hash computation, use modulo arithmetic and choose a
prime number. That prime number corresponds to the number of entries in
your array.
A google search turns up tons of information about this. For example:
http://www.concentric.net/~Ttwang/tech/sorthash.htm talks about linear
hashing (and it doesn’t get simpler.) Another says much of the same
thing: http://www.gmonline.demon.co.uk/cscene/CS5/CS5-02.html (“To
effectively use a hash table you need to know approximately how many
data items you will have.”)
Hashing seems “simple” conceptually but it is often difficult to balance
between common cases (small hash tables) and uncommon cases (large hash
tables). My experience is that those uncommon cases are often the more
performance sensitive. The more you know about the size and layout of
your data set, the better the implementation will perform.
Even the most complex hashing implementation shouldn’t take you more
than a few hours to put together - hashing just isn’t tough to
implement. Choosing the right table size and layout is FAR more
important than anything else.
Good luck. This is likely to be one of the simplest issues with which
you’ll deal in file systems land. ![]()
Regards,
Tony
Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com
Looking forward to seeing you at the next OSR File Systems class in
Boston, MA April 24-27, 2006.
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Rufoo
Sent: Tuesday, January 03, 2006 11:52 PM
To: ntfsd redirect
Subject: [ntfsd] Hash for file handles
What is the simplest hash technique you would use for
storing file handles in a hash table?
Yahoo! DSL - Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17
You are currently subscribed to ntfsd as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com
And here is someone who claims to have the best hash algorithm; claims
true or not, the article is good in that it discusses how to evaluate
hash functions and their efficiency:
http://burtleburtle.net/bob/hash/doobs.html
Note that the author is particularly concerned about processor
utilization. Again, this isn’t an issue unless you really are doing a
lot of hashing.
Regards,
Tony
Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com
Looking forward to seeing you at the next OSR File Systems class in
Boston, MA April 24-27, 2006.
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Rufoo
Sent: Tuesday, January 03, 2006 11:52 PM
To: ntfsd redirect
Subject: [ntfsd] Hash for file handles
What is the simplest hash technique you would use for
storing file handles in a hash table?
Yahoo! DSL - Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17
You are currently subscribed to ntfsd as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com
>organized as a b-tree (or splay tree - there’s a splay tree package
built into Windows, and they have great locality characteristics.)
XP and later also has AVL tree in them.
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com