Diagnosing cc-control crashes (aka "Cluster host connection failure")

Read 29377 times
If you receive the following message:
Code: [Select]
Cluster host connection failure for Local server: Connection refused (111)
...then the cc-control process has exited for some reason.

UPDATE - December 2013 - IMPORTANT!

We're finding that a lot of clients are skipping the initial steps in this article and jumping straight into debugging with gdb or enabling core dumps.  This is counterproductive.

The very first steps you should take when diagnosing this problem are as follows:

  • Ensure that cc-control hasn't simply been shut down.  Try running:
Code: [Select]
/etc/init.d/centovacast start
    If cc-control starts up and the problem does not recur, then there is nothing to diagnose and you can stop here.
    If cc-control gives an error at this point, then send us the error.  No need to troubleshoot beyond this point.

  • Normally, Centova Cast's cron job will automatically restart cc-control within 60 seconds if it exits for ANY reason, so if it remains down for more than a minute, it's likely that there is a problem with your cron job or that your firewall is blocking access to localhost on port 2198.    Check your cron logs (/var/log/cron or /var/log/messages) to determine why the "/etc/init.d/centovacast check" cron job is not running correctly.  This cron job should be configured in /etc/cron.d/centovacast.

  • If you believe cc-control is crashing and wish to troubleshoot cc-control to determine why, continue reading, but please read carefully.  As explained below, after enabling core dumps, you MUST wait until the next time cc-control crashes.



Debugging a crash

If cc-control is actually crashing and you wish to determine why, you can diagnose it as follows.

Update Centova Cast to the latest build

If you believe cc-control is in fact crashing with a segfault,  run the update command to ensure that you are running the very latest build.

The data collected by the procedure below relies on values built into the executable file which change every time we rebuild cc-control. Accordingly, the data is only useful if we test it against the exact same build of cc-control that you are using. If you are using an outdated build of cc-control -- even if it's just a couple of days old -- the data will be totally useless to us.

So even if you think that you are running the latest build, update anyway:

Code: [Select]
/usr/local/centovacast/sbin/update

Enable core dumps

Enable core dumps on your server by running the following command as root:

Code: [Select]
/usr/local/centovacast/sbin/enable_coredumps start

Wait for a core dump

Periodically check /var/spool/coredumps and look for files named core.cc-control_*. As soon as at least one such file exists, zip (or tar/gzip) it up and send it to us in a support ticket.

Disable core dumps

Once you've sent us a core dump, disable core dumps by running:

Code: [Select]
/usr/local/centovacast/sbin/enable_coredumps stop

Also note that recent builds of Centova Cast include crash detection and recovery code.  So in the event of a control daemon crash, Centova Cast will automatically restart cc-control within a minute or two, and you may not even notice it was down.  So be sure to periodically check the debug log file to see if a crash has been recorded.

Happy debugging. :)
Last Edit: May 12, 2016, 11:48:37 am by AlexiuB