libketama - a consistent hashing algo for memcache clients

RSS
Partager

10 avr. 2007, 17h36m

We wrote ketama to replace how our memcached clients mapped keys to servers. Previously, clients mapped keys->servers like this:

server = serverlist[hash(key)%serverlist.length];

This meant that whenever we added or removed servers from the pool, everything hashed to different servers, which effectively wiped the entire cache. We add (and sometimes remove) servers from the memcached pool often enough to warrant writing this - if your memcached pool never changes, you can probably stop reading now :)

Ketama is an implementation of a consistent hashing algorithm, meaning you can add or remove servers from the memcached pool without causing a complete remap of all keys.

Here's how it works:

* Take your list of servers (eg: 1.2.3.4:11211, 5.6.7.8:11211, 9.8.7.6:11211)
* Hash each server string to several (100-200) unsigned ints
* Conceptually, these numbers are placed on a circle called the continuum. (imagine a clock face that goes from 0 to 2^32)
* Each number links to the server it was hashed from, so servers appear at several points on the continuum, by each of the numbers they hashed to.
* To map a key->server, hash your key to a single unsigned int, and find the next biggest number on the continuum. The server linked to that number is the correct server for that key.
* If you hash your key to a value near 2^32 and there are no points on the continuum greater than your hash, return the first server in the continuum.

If you then add or remove a server from the list, only a small proportion of keys end up mapping to different servers.


The majority of the code is a C library (libketama) and a php4 extension that wraps it. I've also included a class from our Java client. (Java Collections makes it rather easy). We use a single-server memcache client wrapped with a native php class to make it multi-server capable, so we just replaced the hashing method with a ketama_find_server call. (should be easy enough to plug this into libmemcache if need be)

http://static.last.fm/rj/ketama.tar.bz2
http://static.last.fm/ketama/ketama-0.1.1.tar.bz2
svn://svn.audioscrobbler.net/misc/ketama/

We've been using this in production for all our php installs and java services at Last.fm for around 10 days now. We deployed it just in time to smooth over moving loads of webservers between datacenters.

Commentaires

  • Russ

    FWIW, the references for this are: http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf and http://www8.org/w8-papers/2a-webserver/caching/paper2.html

    10 avr. 2007, 18h18m
  • RJ

    I should also reference some perl code (that i think came from Brad) which i've lost all trace of :/

    10 avr. 2007, 18h28m
  • kryton02

    I've never heard of consistent hashing before. What a fantastic idea

    11 avr. 2007, 0h24m
  • muesli

    Please note: I just updated the tarball, so first of all please redownload it from here: http://static.last.fm/ketama/ketama-0.1.1.tar.bz2 Installation ============ * libketama (the general purpose C library) $ cd libketama $ make $ su -c make install This will compile libketama and install it to the default prefix /usr/local. You can change the prefix by editing the PREFIX variable in 'Makefile'. * php_ketama (PHP4 extension that wraps libketama and therefore depends on it) $ cd php-4.4.x/ext $ ln -s /your/ketama/php_ketama ketama $ cd .. $ rm -Rf autom4te.cache $ ./buildconf --force $ ./configure --all_your_configure_options --with-ketama[=/your/ketama/prefix] $ make $ su -c make install Don't forget you might have to restart your httpd! cheers, muesli

    11 avr. 2007, 5h05m
  • sickdm

    Kickass, you hackers! :) -Anthony @ hype machine

    15 avr. 2007, 23h40m
  • rschuil

    First of all: great work! I've been playing with Chord and Kademlia in the past and loved to see your library for memcached. I suggest that you make it a project on SF or so, so people can join to continue it's development. Ketama is also nice to use in combination with MySQL slaves. Sure, you can do load balancing using e.g. LVS, but as the number of slaves grow, your query cache efficiency drops (query cache entries are not shared between mysql slaves. It gets less and less likely that a specific query was already in that slave's cache.) You could hash the SELECT query to find your preferred MySQL slave to connect to. All requests for the same query would be routed to the same MySQL slave, thus resulting in a more efficient use of the query cache. If a slave dies, it will pick the next greater value from the continuum and find another slave to connect to. Just an idea :)

    20 jui. 2007, 9h42m
  • RJ

    Yup - you could use a consistent hash like ketama to improve various things. We also considered using it to balance internal webservice requests. I will stick it on SF when i get a chance. In the meantime, i will happily review and commit patches against our public SVN of the project: svn://svn.audioscrobbler.net/misc/ketama/

    20 jui. 2007, 11h06m
  • rschuil

    Just wondering: it looks to me that the extension currently returns only a single node. What if the node is unreachable? How can you failover to another node? I guess that you'd have to increment server->point by one and call ketama_get_server() again to find the next node on the continuum?

    23 jui. 2007, 9h40m
  • RJ

    Yeah that's the right approach, but not implemented - we don't do failover between nodes because of the possibility of stale cached data, we would just hit the DB if a node is unresponsive. We have a monitoring daemon that checks all the nodes and pushes out updated ketama.servers files to all the webnodes should a memcache server go down.

    23 jui. 2007, 10h30m
  • rschuil

    we don't do failover between nodes because of the possibility of stale cached data If you push out an updated ketama.servers, it would have the same effect and thus the same problem, wouldn't it? It would be nice if ketama_get_servers would return e.g. three candidate nodes instead of one. If one doesn't respond, simply try the next one and the next one. If all three nodes fail to connect, hit the db.

    23 jui. 2007, 10h59m
  • RJ

    If you are centrally controlling when to add and remove servers from the pool, you can do a flush_all on servers that you re-add, to avoid the stale data issue. It should be easy enough to pass get_servers a parameter indicating the number of candidates you want, and have it return an array.

    23 jui. 2007, 11h13m
  • ludvigericson

    Hello RJ, just would like you to know this, I've got a fix for the bug you mention in ketama/TODO: * Invalid sever definition file causes ketama_roll to crash. And we're developing a Python extension wrapped around libketama. You'll be hearing from us when it's ready for release, along with aforementioned bug-fix. Thank you for a great library.

    25 jui. 2007, 15h56m
  • RJ

    Great - if you have patches against the svn feel free to send them over. When your python lib is done i will gladly add it to the ketama page http://www.audioscrobbler.net/development/ketama/ Currently top of my ketama wishlist is adding ketama support to http://pecl.php.net/package/memcache - hopefully i'll get a chance to look at that in the next few weeks.

    25 jui. 2007, 16h01m
  • rschuil

    Hi RJ, I have trouble doing a checkout from your subversion repository. Using svn:// my client claims that your server does not respond, http:// doesn't seem to be supported by your server. Could you please verify that it is working externally?

    27 jui. 2007, 8h16m
  • rschuil

    Grmbl. From a different location the checkout does work - port 3690 must be blocked by the firewall here.

    27 jui. 2007, 8h21m
  • rschuil

    FYI: Just compiled libketama with FLV-1 hash instead of MD5. It is about twice as fast now. I'll contribute my patch to the repository as soon as I'm done testing

    27 jui. 2007, 13h49m
  • rschuil

    I analyzed the output of ketama_test for both hash functions: Number of duplicate keys MD5: 141 FNV-1a: 4 Execution time MD5: 0m2.990s FNV-1a: 0m1.290s Distribution among servers is the same, if not slightly better.

    27 jui. 2007, 15h03m
  • RJ

    Cool :) We would have a hard time changing, as we'd need it working and tested in java too.. It could certainly be made an option in libketama tho.

    27 jui. 2007, 15h21m
  • ludvigericson

    I'd say LOOKUP3 looks like a more proper algorithm to use, MD5 is of course a bit shaky since it's made for _secure_ hashing in mind, not diversity & speed. Anyway, I'd say it's better if the Java class could utilize libketama.so as well, so as to tie it all down to one place.

    29 jui. 2007, 18h09m
  • RJ

    Yeah i just used md5 because it was convenient.. I've put the patch from rschuil into a patches folder in SVN (svn.audioscrobbler.net) for now. My initial testing shows it's quite a bit faster - a 3.7s test reduced to 2.0s when using FNV. Distribution looks about the same. Will post an update once i have tidied up the lib and merged in a couple more changes i have lying around.

    1 août 2007, 11h28m
  • RJ

    public svn updated, license changed to BSD, some style and other patches from blogg.se folks - python patch imminent, still testing fnv one. off to the pub ;)

    3 août 2007, 19h16m
  • psz

    How does libketama relate to consistent hashing introduced recently in pecl/memcache?

    17 oct. 2007, 12h25m
  • RJ

    similar, but ketama uses md5 instead of crc32. I've not yet had a chance to test the new branch of pecl memcache cvs. there is a python and java implementation of the ketama algorithm in our SVN. So compatibility with non-php languages is the only reason to use it over the pecl branch really - at least until it is integrated into a memcache extension.

    17 oct. 2007, 15h36m
  • ludvigericson

    Though, I don't see why you'd use CRC32 for something that isn't data integrity checking. CRC32 is designed to give distant checksums only for two *similar* inputs, I'll try to explain better: When f is CRC32, f(AAA) = 9839781 (not true, example) f(AAB) = 2387523 ( ^ ) But, f(ZZZ) = 9839782 (same) What I'm trying to say is that CRCs are designed to detect small changes. MD5 isn't ideal either, but it's designed to have good distribution all-over. CRC32 isn't. Using a checksum as hashing is just backwards, IMHO. Again, MD5 isn't optimal either, but it's a better choice.

    17 mai 2008, 11h44m
  • mikhailp

    Hey guys, everything compiles smoothly on my machine and works with php5, but the "make test" command for libketama is broken OURHASH=`LD_LIBRARY_PATH=. ./ketama_test ${SERVERFILE} | md5sum | cut -b0-32` should read OURHASH=`LD_LIBRARY_PATH=. ./ketama_test ${SERVERFILE} | md5sum | cut -b1-32` The test then completes as expected and everything seems to work. Also, any update on the fnv patch? Cheers.

    19 juin 2008, 22h01m
Voir les 35 commentaires
Ajouter un commentaire. Connectez-vous à Last.fm ou inscrivez-vous (c'est gratuit).