Client connection timeout or disconnect errors with Apache ZooKeeper

Several users of ZooKeeper have reported problems relating to disconnects and expired sessions from the client. For most users they are using standard settings out of the box for ZooKeeper with their tickTime=2000 and timeout set to 8000ms.

Some users have reported this to be caused by something known as being oversubscribed on the service. This can be especially true in virtual environments such as EC2. In essence the node may experience a network connectivity issue for several seconds that causes the failed connection.

Occasionally users have reported issues with the Garbage Collection mechanics – but this is not as likely. A good piece of advice is the ensure that proper logging of your garbage collection services is in place and reduce the frequency that the garbage collecting is occurring.

So how do you fix this?

After a long while we discovered that the client is (unfortunately) flawed in it’s writing of PHP. The way that is handles threading is problematic as PHP is just not known for being a threading safe language. We have evaluated re-writing the client to different language such as Python – but have not yet endeavored down that road.

The route we chose then was to build in a success/failure mechanism that will retry in x number of minutes if a subscription does not occur. This can be accomplished via a cron job that runs either a validation script or even runs the key subscription event if it is missing.

Posted in , and tagged , .

Leave a Reply

Your email address will not be published. Required fields are marked *