AMP Slow or Unresponsive
There are many possible causes for an AMP server becoming slow or unresponsive. This guide describes some possible reasons, and some commands and tools that can help diagnose the problem.
Possible reasons include:
- CPU is max’ed out
- Memory usage is extremely high
- SSH’ing is very slow due (e.g. due to lack of entropy)
- Out of disk space
See AMP Requirements for details of server requirements.
Machine Diagnostics
The following commands will collect OS-level diagnostics about the machine, and about the AMP process. The commands below assume use of CentOS 6.x. Minor adjustments may be required for other platforms.
OS and Machine Details
To display system information, run:
To show details of the CPU and memory available to the machine, run:
User Limits
To display information about user limits, run the command below (while logged in as the same user who runs AMP):
If AMP is run as a different user (e.g. with user name “adalovelace”), then instead run:
Of particular interest is the limit for “open files”.
See Increase System Resource Limits for more information.
Disk Space
The command below will list the disk size for each partition, including the amount used and available. If the AMP base directory, persistence directory or logging directory are close to 0% available, this can cause serious problems:
CPU and Memory Usage
To view the CPU and memory usage of all processes, and of the machine as a whole, one can use the
top
command. This runs interactively, updating every few seconds. To collect the output once
(e.g. to share diagnostic information in a bug report), run:
File and Network Usage
To count the number of open files for the AMP process (which includes open socket connections):
To count (or view the number of “established” internet connections, run:
Linux Kernel Entropy
A lack of entropy can cause random number generation to be extremely slow. This can cause tasks like ssh to also be extremely slow. See Linux kernel entropy for details of how to work around this.
SSHD Limits
Cloudsoft AMP will attempt to re-use the SSH connections to machines on a per-location basis, by default,
keeping sessions open for up to 5 minutes if the entity/location is managed.
If the same target is used via multiple SshMachineLocation
instances
(such as through BYON or localhost), this may trigger SSHD throttling.
This can be resolved by setting either sshCacheExpiryDuration: 10s
or brooklyn.ssh.config.close: true
,
on the location, as described here.
It could also be resolved by increasing MaxSessions
and MaxStartups
in sshd_config
on the target system.
More info on SSHD limits are documented here.
Process Diagnostics
Thread and Memory Usage
To get memory and thread usage for the AMP (Java) process, two useful tools are jstack
and jmap
. These require the “development kit” to also be installed
(e.g. yum install java-1.8.0-openjdk-devel
). Some useful commands are shown below:
Runnable Threads
The jstack-active
script is a convenient light-weight way to quickly see which threads of a running AMP
server are attempting to consume the CPU. It filters the output of jstack
, to show only the
“really-runnable” threads (as opposed to those that are blocked).
OSGi Resolution
The Karaf OSGi subsystem can in some cases spend a lot of energy identifying which bundles should provide packages.
Often this can be resolved by removing redundant bundles.
To observe this, add the following to etc/org.ops4j.pax.logging.cfg
(no restart normally needed):
log4j2.logger.felix.name = org.apache.felix
log4j2.logger.felix.level = DEBUG
To get a unique list of duplicate packages: grep chains data/log/*.debug.log | sed 's/.*exposed to //' | sed 's/from .*//' | sort | uniq
Profiling
If an in-depth investigation of the CPU usage (and/or object creation) of an AMP Server is requiring, there are many profiling tools designed specifically for this purpose. These generally require that the process be launched in such a way that a profiler can attach, which may not be appropriate for a production server.
Log Files
Cloudsoft AMP will by default create brooklyn.info.log and brooklyn.debug.log files. See the Logging docs for more information.
The following are useful log messages to search for (e.g. using grep
). Note the wording of
these messages (or their very presence) may change in future version of AMP.
Normal Logging
The lines below are commonly logged, and can be useful to search for when finding the start of a section of logging.
Memory Usage
The debug log includes (every minute) a log statement about the memory usage and task activity. For example:
These can be extremely useful if investigating a memory or thread leak, or to determine whether a surprisingly high number of tasks are being executed.
Subscriptions
One source of high CPU in AMP is when a subscription (e.g. for a policy or enricher) is being triggered many times (i.e. handling many events). A log message like that below will be logged on every 1000 events handled by a given single subscription.
If a subscription is handling a huge number of events, there are a couple of common reasons:
- first, it could be subscribing to too much activity - e.g. a wildcard subscription for all events from all entities.
- second it could be an infinite loop (e.g. where an enricher responds to a sensor-changed event by setting that same sensor, thus triggering another sensor-changed event).
User Activity
All activity triggered by the REST API or web-console will be logged. Some examples are shown below:
Entity Activity
If investigating the behaviour of a particular entity (e.g. on failure), it can be very useful to
grep
the info and debug log for the entity’s id. For a software process, the debug log will
include the stdout and stderr of all the commands executed by that entity.
It can also be very useful to search for all effector invocations, to see where the behaviour has been triggered: