cloudsoft.io

AMP Slow or Unresponsive

There are many possible causes for an AMP server becoming slow or unresponsive. This guide describes some possible reasons, and some commands and tools that can help diagnose the problem.

Possible reasons include:

  • CPU is max’ed out
  • Memory usage is extremely high
  • SSH’ing is very slow due (e.g. due to lack of entropy)
  • Out of disk space

See AMP Requirements for details of server requirements.

Machine Diagnostics

The following commands will collect OS-level diagnostics about the machine, and about the AMP process. The commands below assume use of CentOS 6.x. Minor adjustments may be required for other platforms.

OS and Machine Details

To display system information, run:

uname -a

To show details of the CPU and memory available to the machine, run:

cat /proc/cpuinfo
cat /proc/meminfo

User Limits

To display information about user limits, run the command below (while logged in as the same user who runs AMP):

ulimit -a

If AMP is run as a different user (e.g. with user name “adalovelace”), then instead run:

ulimit -a -u adalovelace

Of particular interest is the limit for “open files”.

See Increase System Resource Limits for more information.

Disk Space

The command below will list the disk size for each partition, including the amount used and available. If the AMP base directory, persistence directory or logging directory are close to 0% available, this can cause serious problems:

df -h

CPU and Memory Usage

To view the CPU and memory usage of all processes, and of the machine as a whole, one can use the top command. This runs interactively, updating every few seconds. To collect the output once (e.g. to share diagnostic information in a bug report), run:

top -n 1 -b > top.txt

File and Network Usage

To count the number of open files for the AMP process (which includes open socket connections):

BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin
BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java)
lsof -p $BROOKLYN_PID | wc -l

To count (or view the number of “established” internet connections, run:

netstat -an | grep ESTABLISHED | wc -l

Linux Kernel Entropy

A lack of entropy can cause random number generation to be extremely slow. This can cause tasks like ssh to also be extremely slow. See Linux kernel entropy for details of how to work around this.

SSHD Limits

Cloudsoft AMP will attempt to re-use the SSH connections to machines on a per-location basis, by default, keeping sessions open for up to 5 minutes if the entity/location is managed. If the same target is used via multiple SshMachineLocation instances (such as through BYON or localhost), this may trigger SSHD throttling.

This can be resolved by setting either sshCacheExpiryDuration: 10s or brooklyn.ssh.config.close: true, on the location, as described here.

It could also be resolved by increasing MaxSessions and MaxStartups in sshd_config on the target system. More info on SSHD limits are documented here.

Process Diagnostics

Thread and Memory Usage

To get memory and thread usage for the AMP (Java) process, two useful tools are jstack and jmap. These require the “development kit” to also be installed (e.g. yum install java-1.8.0-openjdk-devel). Some useful commands are shown below:

BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin
BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java)

jstack $BROOKLYN_PID
jmap -histo:live $BROOKLYN_PID
jmap -heap $BROOKLYN_PID

Runnable Threads

The jstack-active script is a convenient light-weight way to quickly see which threads of a running AMP server are attempting to consume the CPU. It filters the output of jstack, to show only the “really-runnable” threads (as opposed to those that are blocked).

BROOKLYN_HOME=/home/users/brooklyn/apache-brooklyn-0.9.0-bin
BROOKLYN_PID=$(cat $BROOKLYN_HOME/pid_java)

curl -O https://raw.githubusercontent.com/apache/brooklyn-dist/master/scripts/jstack-active.sh

jstack-active $BROOKLYN_PID

OSGi Resolution

The Karaf OSGi subsystem can in some cases spend a lot of energy identifying which bundles should provide packages. Often this can be resolved by removing redundant bundles. To observe this, add the following to etc/org.ops4j.pax.logging.cfg (no restart normally needed):

log4j2.logger.felix.name = org.apache.felix
log4j2.logger.felix.level = DEBUG

To get a unique list of duplicate packages: grep chains data/log/*.debug.log | sed 's/.*exposed to //' | sed 's/from .*//' | sort | uniq

Profiling

If an in-depth investigation of the CPU usage (and/or object creation) of an AMP Server is requiring, there are many profiling tools designed specifically for this purpose. These generally require that the process be launched in such a way that a profiler can attach, which may not be appropriate for a production server.

Log Files

Cloudsoft AMP will by default create brooklyn.info.log and brooklyn.debug.log files. See the Logging docs for more information.

The following are useful log messages to search for (e.g. using grep). Note the wording of these messages (or their very presence) may change in future version of AMP.

Normal Logging

The lines below are commonly logged, and can be useful to search for when finding the start of a section of logging.

2016-05-30 17:05:51,458 INFO  o.a.b.l.AMPWebServer [main]: Started AMP console at http://127.0.0.1:8081/, running classpath://brooklyn.war
2016-05-30 17:06:04,098 INFO  o.a.b.c.m.h.HighAvailabilityManagerImpl [main]: Management node tF3GPvQ5 running as HA MASTER autodetected
2016-05-30 17:06:08,982 INFO  o.a.b.c.m.r.InitialFullRebindIteration [brooklyn-execmanager-rvpnFTeL-0]: Rebinding from /home/compose/compose-amp-state/brooklyn-persisted-state/data for master rvpnFTeL...
2016-05-30 17:06:11,105 INFO  o.a.b.c.m.r.RebindIteration [brooklyn-execmanager-rvpnFTeL-0]: Rebind complete (MASTER) in 2s: 19 apps, 54 entities, 50 locations/reference, 46 policies, 704 enrichers, 0 feeds, 160 catalog items

Memory Usage

The debug log includes (every minute) a log statement about the memory usage and task activity. For example:

2016-05-27 12:20:19,395 DEBUG o.a.b.c.m.i.AMPGarbageCollector [brooklyn-gc]: AMP gc (before) - using 328 MB / 496 MB memory (5.58 kB soft); 69 threads; storage: {datagrid={size=7, createCount=7}, refsMapSize=0, listsMapSize=0}; tasks: 10 active, 33 unfinished; 78 remembered, 1696906 total submitted)
2016-05-27 12:20:19,395 DEBUG o.a.b.c.m.i.AMPGarbageCollector [brooklyn-gc]: AMP gc (after) - using 328 MB / 496 MB memory (5.58 kB soft); 69 threads; storage: {datagrid={size=7, createCount=7}, refsMapSize=0, listsMapSize=0}; tasks: 10 active, 33 unfinished; 78 remembered, 1696906 total submitted)

These can be extremely useful if investigating a memory or thread leak, or to determine whether a surprisingly high number of tasks are being executed.

Subscriptions

One source of high CPU in AMP is when a subscription (e.g. for a policy or enricher) is being triggered many times (i.e. handling many events). A log message like that below will be logged on every 1000 events handled by a given single subscription.

2016-05-30 17:29:09,125 DEBUG o.a.b.c.m.i.LocalSubscriptionManager [brooklyn-execmanager-rvpnFTeL-8]: 1000 events for subscriber Subscription[SCFnav9g;CanopyComposeApp{id=gIeTwhU2}@gIeTwhU2:webapp.url]

If a subscription is handling a huge number of events, there are a couple of common reasons:

  • first, it could be subscribing to too much activity - e.g. a wildcard subscription for all events from all entities.
  • second it could be an infinite loop (e.g. where an enricher responds to a sensor-changed event by setting that same sensor, thus triggering another sensor-changed event).

User Activity

All activity triggered by the REST API or web-console will be logged. Some examples are shown below:

2016-05-19 17:52:30,150 INFO  o.a.b.r.r.ApplicationResource [brooklyn-jetty-server-8081-qtp1058726153-17473]: Launched from YAML: name: My Example App
location: aws-ec2:us-east-1
services:
- type: org.apache.brooklyn.entity.webapp.tomcat.TomcatServer

2016-05-30 14:46:19,516 DEBUG brooklyn.REST [brooklyn-jetty-server-8081-qtp1104967201-20881]: Request Tisj14 starting: POST /v1/applications/NiBy0v8Q/entities/NiBy0v8Q/expunge from 77.70.102.66

Entity Activity

If investigating the behaviour of a particular entity (e.g. on failure), it can be very useful to grep the info and debug log for the entity’s id. For a software process, the debug log will include the stdout and stderr of all the commands executed by that entity.

It can also be very useful to search for all effector invocations, to see where the behaviour has been triggered:

2016-05-27 12:45:43,529 DEBUG o.a.b.c.m.i.EffectorUtils [brooklyn-execmanager-gvP7MuZF-14364]: Invoking effector stop on TomcatServerImpl{id=mPujYmPd}