Ticket #61 (new RFC)

Opened 6 years ago

Last modified 4 years ago

Investigate using an external hash library

Reported by: coke Owned by: cotto
Priority: minor Milestone:
Component: library Version:
Severity: low Keywords: hash
Cc: Language:
Patch status: Platform: all

Description

We spend a lot of cycles on our hand-rolled hashing, both in terms of developer effort and CPU.

It would be nice if we could use an off the shelf hash package.

Any package under consideration would have to have a compatible license, and work on our minimal core platforms.

Change History

Changed 5 years ago by jkeenan

  • component changed from none to library

Changed 4 years ago by whiteknight

  • owner set to whiteknight

I set this ticket up as a task for GCI 2010. I'm hoping some intrepid young student can come up with a good list of information and recommendations for us to follow. If we can't make some kind of decision after that, we need to close this old ticket.

Changed 4 years ago by cotto

  • keywords gci added

Changed 4 years ago by whiteknight

  • keywords hash added; gci removed
  • platform set to all
  • owner changed from whiteknight to cotto
  • priority changed from normal to minor

From GCI student Atanas:

I checked some hashing libraries and I chose the following: LIBHASHISH - Build-in key support for char arrays (strings) and uint_{8,16,32} types and support for own (possible complex) key data types - Support rbtree's as collision strategy instead of bucked lists (avoid worst case O(n) scenario) - Dynamic or manual adjustable table size (reordering if proportion entries/table_size is unfavorable) - Build as an static or dynamic library (.a and .so) - Iterator support (equivalent to ruby's hash.each() method) - Thread clean - fine-grained lock mechanisms (mutex locks, reader writer locks, ...) - Bloom filter implementation - Many built-in hash algorithms (from trivial algorithms till cryptographic ones) - Architecture clean - runs on 32bit machines as well as 64bit machines - As lightweight as possible - no bloated code - Makefile test target plus benchmark applications for comparing the different hashing algorithm - Dual licensed under the GNU General Public and BSD License Website is  http://libhashish.sourceforge.net/. I think this is the best hashing library for your project.

So there's that input. I'm going to reassign this RFC to cotto while we figure out if we want to move to a new hashing library and, if so, which one.

Changed 4 years ago by nwellnhof

LSHKIT is for locality sensitive hashing, CMPH for minimal perfect hashing and Mhash for cryptographic hash functions. That's not what we need.

libghthash  http://www.ipd.bth.se/ska/sim_home/libghthash.html would be an example for a useful hash table library.

But I don't think we could gain much by using an external library. If we want to optimize our current implementation, I'd suggest a dead simple hash table with open addressing and linear probing. That would give us at most one dcache miss for practically all reads if we make sure that the fill factor stays low enough by rehashing.

Such a hash table would only require 2 words per entry vs. 4 words per entry for our current chained implementation. So even if we limit the fill factor to 50%, we wouldn't need more memory than now.

We'd need two additional bits to mark occupied and deleted entries. For pointer keys or values we could store them in the low bits of the pointers. For integer keys and values, we'd need an additional array for those flags.

Note: See TracTickets for help on using tickets.