Troubleshooting: Going Deep in Java and Logs
This guide takes a deep look at the Java and log messages for some failure scenarios, giving common steps used to identify the issues.
Script Failure
Many blueprints run bash scripts as part of the installation. TOSCA nodes are configured using interfaces. Executable artifacts can be attached to operations and run during the application lifecycle.
The tosca-tomcat9-node
type that is used in the examples in this section, is declared as shown below:
tosca-tomcat9-node:
derived_from: tosca.nodes.WebServer
# properties & attributes omitted for brevity
interfaces:
Standard:
create:
inputs:
TOMCAT_gz: { get_property: [ SELF, tomcat.url ] }
implementation:
primary: classpath://tomcat9-scripts/tomcat-create.sh
configure:
implementation:
primary: classpath://tomcat9-scripts/tomcat-configure.sh
dependencies:
- ROOT_war
start: classpath://tomcat9-scripts/tomcat-start.sh
stop: classpath://tomcat9-scripts/tomcat-stop.sh
artifacts:
ROOT_war:
type: tosca.artifacts.File
Each phase of the lifecycle is implemented by a script attached to the operation with the same name. To produce a failure, the tomcat-configure.sh
bash script is modified and a code snippet that creates four directories is added, but the command is incorrect.
#!/bin/bash
# the configuration code omitted for brevity
mkrid -p conf logs webapps temp # `mkdir` changed to `mkrid`
The bundle containing the tosca-tomcat9-node
type, must be rebuilt and re-added to the catalog. Now let’s try deploying an application this using the following YAML:
tosca_definitions_version: tosca_simple_yaml_1_3
metadata:
template_name: Tomcat failure test
topology_template:
node_templates:
f316eyHX:
type: tosca-tomcat9-node
artifacts:
ROOT_war:
file: classpath://io.cloudsoft.brooklyn.tosca13.tosca-3-tier:resources/tosca-hello-world-sql-webapp.war
Shortly after deployment, the entity fails with the following error:
SSH task ended with exit code 127 when 0 was required, in Task[run bash script: classpath://tomcat9-scripts/tomcat-configure.sh]@ulRPaBCT: run bash script: classpath://tomcat9-scripts/tomcat-configure.sh
By expanding the stderr
section, we can see the script failed with the following error:
This tells us what went wrong, but doesn’t tell us where. In order to find that, we’ll need to look at the stack trace that was logged when the exception was thrown.
It’s always worth looking at the Detailed Status section as sometimes this will give you the information you need. In this case, the stack trace is limited to the thread that was used to execute the task that ran the script:
In order to find the exception, we’ll need to look in AMP’s debug log file. By default, the debug log file is named amp.debug.log
. Usually the easiest way to navigate the log file is to use less
, e.g. less amp.debug.log
. We can quickly find the stack trace by first navigating to the end of the log file with Shift-G
, then performing a reverse-lookup by typing ?Tomcat
and pressing Enter
. If searching for the blueprint type (in this case Tomcat) simply matches tasks unrelated to the exception, you can also search for
the text of the error message, in this case return status 127
. You can make the search case-insensitivity by typing -i
before performing the search. To skip the current match and move to the next one (i.e. ‘up’ as we’re performing a reverse-lookup), simply press n
.
In this case, the ?Tomcat
search takes us directly to the full stack trace (Only the last part of the trace is shown here):
Caused by: java.util.concurrent.ExecutionException: org.apache.brooklyn.util.exceptions.PropagatedRuntimeException: Error invoking start at ToscaSoftwareComponentNodeEntityImpl{id=oq7xztoxnm}: Error invoking Standard.configure at ToscaSoftwareComponentNodeEntityImpl{id=oq7xztoxnm}: SSH task ended with exit code 127 when 0 was required, in Task[run bash script: classpath://tomcat9-scripts/tomcat-configure.sh]@ulRPaBCT: run bash script: classpath://tomcat9-scripts/tomcat-configure.sh
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_282]
at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_282]
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63) ~[!/:?]
at org.apache.brooklyn.util.core.task.BasicTask.get(BasicTask.java:390) ~[?:?]
...
Caused by: java.lang.IllegalStateException: SSH task ended with exit code 127 when 0 was required, in Task[run bash script: classpath://tomcat9-scripts/tomcat-
configure.sh]@ulRPaBCT: run bash script: classpath://tomcat9-scripts/tomcat-configure.sh
at org.apache.brooklyn.util.core.task.system.ProcessTaskWrapper$ProcessTaskInternalJob.logWithDetailsAndThrow(ProcessTaskWrapper.java:151) ~[?:?]
at org.apache.brooklyn.util.core.task.system.ProcessTaskWrapper$ProcessTaskInternalJob.call(ProcessTaskWrapper.java:125) ~[?:?]
... 5 more
Combining this with the error message of mkrid: command not found
we can see that indeed mkdir
has been misspelled mkrid
on line 21 of classpath://tomcat9-scripts/tomcat-configure.sh
executable artifact.
External Failure
Sometimes an entity will fail outside the direct commands issues by AMP. When installing and launching an entity, AMP will check the return code of scripts that were run to ensure that they completed successfully (i.e. the return code of the script is zero). It is possible, for example, that a launch script completes successfully, but the entity fails to start.
We can simulate this type of failure by launching Tomcat without an artifact to deploy. Since the WAR artifact deployed on the Tomcat instance is provided via URL, the easiest way is to modify the url to search for a version of that artifact that does not exist. So changing this URL
http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/0.9.0/brooklyn-example-hello-world-sql-webapp-0.9.0.war
into
http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war
Should cause a failure. AMP has no way to know if the artifact is available or not until the tomcat-configure.sh
script tries to download it.
Let’s deploy Tomcat again, using the same YAML as before. This time, the deployment runs for a few minutes before failing
with Required entity not healthy
:
If we drill down into the tasks in the Activities
tab, we can see that all the tasks up to Standard.configure
completed successfully, and stdout of the tomcat-configure.sh
script is as follows:
The Detailed section of the Standard.configure
is as follows:
Failed after 1.82s: Error invoking Standard.configure at ToscaSoftwareComponentNodeEntityImpl{id=ifharqxgkb}: Error getting resource 'http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war' for ToscaSoftwareComponentNodeEntityImpl{id=ifharqxgkb}: Invalid response invoking http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war: response code 404
org.apache.brooklyn.core.mgmt.internal.EffectorUtils$EffectorCallPropagatedRuntimeException: Error invoking Standard.configure at ToscaSoftwareComponentNodeEntityImpl{id=ifharqxgkb}: Error getting resource 'http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war' for ToscaSoftwareComponentNodeEntityImpl{id=ifharqxgkb}: Invalid response invoking http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war: response code 404
at org.apache.brooklyn.core.mgmt.internal.EffectorUtils$EffectorCallPropagatedRuntimeException.propagate(EffectorUtils.java:316)
at org.apache.brooklyn.core.mgmt.internal.EffectorUtils$EffectorCallPropagatedRuntimeException.access$100(EffectorUtils.java:283)
at org.apache.brooklyn.core.mgmt.internal.EffectorUtils.handleEffectorException(EffectorUtils.java:324)
at org.apache.brooklyn.core.effector.EffectorTasks$EffectorBodyTaskFactory$2.handleException(EffectorTasks.java:90)
at org.apache.brooklyn.util.core.task.DynamicSequentialTask.handleException(DynamicSequentialTask.java:480)
at org.apache.brooklyn.util.core.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:428)
at org.apache.brooklyn.util.core.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:874)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.brooklyn.util.exceptions.PropagatedRuntimeException: Error getting resource 'http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war' for ToscaSoftwareComponentNodeEntityImpl{id=ifharqxgkb}: Invalid response invoking http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war: response code 404
at org.apache.brooklyn.util.exceptions.Exceptions.propagate(Exceptions.java:129)
at org.apache.brooklyn.util.core.task.BasicTask.getUnchecked(BasicTask.java:401)
at org.apache.brooklyn.core.effector.EffectorTasks$EffectorMarkingTaskFactory$1.call(EffectorTasks.java:128)
at org.apache.brooklyn.core.effector.EffectorTasks$EffectorBodyTaskFactory$1.call(EffectorTasks.java:82)
at org.apache.brooklyn.util.core.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:370)
... 5 more
...
Caused by: java.lang.IllegalStateException: Invalid response invoking http://search.maven.org/remotecontent?filepath=org/apache/brooklyn/example/brooklyn-example-hello-world-sql-webapp/2.0.0/brooklyn-example-hello-world-sql-webapp-2.0.0.war: response code 404
at org.apache.brooklyn.util.core.ResourceUtils.getResourceViaHttp(ResourceUtils.java:476)
at org.apache.brooklyn.util.core.ResourceUtils.getResourceFromUrl(ResourceUtils.java:282)
... 23 more
The stacktrace is quite readable and points to the real problem.
Many blueprints run bash scripts as part of the installation. This section highlights how to identify a problem with a bash script.
First let’s take a look at the customize()
method of the Tomcat server blueprint:
Here we can see that it’s running a script to create four directories before continuing with the customization. Let’s
introduce an error by changing mkdir
to mkrid
:
Now let’s try deploying this using the following YAML:
Shortly after deployment, the entity fails with the following error:
Error in task: ssh: customizing TomcatServerImpl{id=u9eof0hszp}
Execution failed, invalid result 127 for customizing TomcatServerImpl{id=u9eof0hszp}
We can drill into the task that fail
ed, directly by clicking the “More information” button or by selecting the Activities
tab:
the list of tasks shown (where the effectors are shown as top-level tasks) are clickable links. Selecting that row will show the details of
that particular task, including its sub-tasks. We can eventually get to the specific sub-task that failed:
By expanding the stderr
section, we can see the script failed with the following error:
This tells us what went wrong, but doesn’t tell us where. In order to find that, we’ll need to look at the stack trace that was logged when the exception was thrown.
It’s always worth looking at the Detailed Status section as sometimes this will give you the information you need. In this case, the stack trace is limited to the thread that was used to execute the task that ran the script:
In order to find the exception, we’ll need to look in AMP’s debug log file. By default, the debug log file
is named brooklyn.debug.log
. Usually the easiest way to navigate the log file is to use less
, e.g.
less brooklyn.debug.log
. We can quickly find the stack trace by first navigating to the end of the log file
with Shift-G
, then performing a reverse-lookup by typing ?Tomcat
and pressing Enter
. If searching for the
blueprint type (in this case Tomcat) simply matches tasks unrelated to the exception, you can also search for
the text of the error message, in this case ? invalid result 127
. You can make the search case-insensitivity by
typing -i
before performing the search. To skip the current match and move to the next one (i.e. ‘up’ as we’re
performing a reverse-lookup), simply press n
In this case, the ?Tomcat
search takes us directly to the full stack trace (Only the last part of the trace
is shown here):
AMP’s use of tasks and helper classes can make the stack trace a little harder than usual to follow, but a good
place to start is to look through the stack trace for the node’s implementation or ssh driver classes (usually
named FooNodeImpl
or FooSshDriver
). In this case we can see the following:
Combining this with the error message of mkrid: command not found
we can see that indeed mkdir
has been
misspelled mkrid
on line 72 of TomcatSshDriver.java
.
Non-Script Failure
The section above gives an example of a failure that occurs when a script is run. In this section we will look at
a failure in a non-script related part of the code. We’ll use the customize()
method of the Tomcat server again,
but this time, we’ll correct the spelling of ‘mkdir’ and add a line that attempts to copy a nonexistent resource
to the remote server:
Let’s deploy this using the same YAML from above. Here’s the resulting error in the AMP debug console:
Again, this tells us what the error is, but we need to find where the code is that attempts to copy this file. In this case it’s shown in the Detailed Status section, and we don’t need to go to the log file:
Looking for Tomcat
in the stack trace, we can see in this case the problem lies at line 79 of TomcatSshDriver.java
External Failure
Sometimes an entity will fail outside the direct commands issues by AMP. When installing and launching an entity, AMP will check the return code of scripts that were run to ensure that they completed successfully (i.e. the return code of the script is zero). It is possible, for example, that a launch script completes successfully, but the entity fails to start.
We can simulate this type of failure by launching Tomcat with an invalid configuration file. As seen in the previous
examples, AMP copies two xml configuration files to the server: server.xml
and web.xml
The first few non-comment lines of server.xml
:
Let’s add an unmatched XML element, which will make this XML file invalid:
As AMP doesn’t know how these types of resources are used, they’re not validated as they’re copied to the remote machine. As far as AMP is concerned, the file will have copied successfully.
Let’s deploy Tomcat again, using the same YAML as before. This time, the deployment runs for a few minutes before failing
with Timeout waiting for SERVICE_UP
:
If we drill down into the tasks in the Activities
tab, we can see that all of the installation and launch tasks
completed successfully, and stdout of the launch
script is as follows:
The task that failed was the post-start
task, and the stack trace from the Detailed Status section is as follows:
This doesn’t really tell us what we need to know, and looking in the brooklyn.debug.log
file yields no further
clues. The key here is the error message Timeout waiting for SERVICE_UP
. After running the installation and
launch scripts, assuming all scripts completed successfully, AMP will periodically check the health of the node
and will set the node on fire if the health check does not pass within a pre-prescribed period (the default is
two minutes, and can be configured using the start.timeout
config key). The periodic health check also continues
after the successful launch in order to check continued operation of the node, but in this case it fails to pass
at all.
The first thing we need to do is to find out how AMP determines the health of the node. The health-check is
often implemented in the isRunning()
method in the entity’s ssh driver. Tomcat’s implementation of isRunning()
is as follows:
The newScript
method has conveniences for default scripts to check if a process is running based on its PID. In this
case, it will look for Tomcat’s PID in the pid.txt
file and check if the PID is the PID of a running process
It’s worth a quick sanity check at this point to check if the PID file exists, and if the process is running.
By default, the pid file is located in the run directory of the entity. You can find the location of the entity’s run
directory by looking at the run.dir
sensor. In this case it is /tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs
.
To find the pid, you simply cat the pid.txt file in this directory:
In this case, the PID in the file is 73714. You can then check if the process is running using ps
. You can also
pipe the output to fold
so the full launch command is visible:
This confirms that the process is running. The next thing we can look at is the service.notUp.indicators
sensor. This
reads as follows:
This confirms that the problem is indeed due to the service.process.isRunning
sensor. We assumed earlier that this was
set by the isRunning()
method in TomcatSshDriver.java
, but this isn’t always the case. The service.process.isRunning
sensor is wired up by the connectSensors()
method in the node’s implementation class, in this case
TomcatServerImpl.java
. Tomcat’s implementation of connectSensors()
is as follows:
We can see here that if jmx is not enabled, the method will call connectServiceUpIsRunning()
which will use the
default PID-based method of determining if a process is running. However, as JMX is running, the service.process.isRunning
sensor (denoted here by the SERVICE_PROCESS_IS_RUNNING
variable) is set to true if and only if the
stateName
JMX attribute equals STARTED
. We can see from the previous call to .pollAttribute
that this
attribute is also published to the CONNECTOR_STATUS
sensor. The CONNECTOR_STATUS
sensor is defined as follows:
Let’s go back to the AMP debug console and look for the webapp.tomcat.connectorStatus
:
The sensor value is null or not set. We know from previous steps that the installation and launch scripts completed, and we know the procecess is running, but we can see here that the server is not responding to JMX requests. A good thing to check here would be that the JMX port is not being blocked by iptables, firewalls or security groups (see the troubleshooting connectivity guide). Let’s assume that we’ve checked that and they’re all open. There is still one more thing that AMP can tell us.
Still on the Sensors
tab, let’s take a look at the log.location
sensor:
This is the location of Tomcat’s own log file. The location of the log file will differ from process to process
and when writing a custom entity you will need to check the software’s own documentation. If your blueprint’s
ssh driver extends JavaSoftwareProcessSshDriver
, the value returned by the getLogFileLocation()
method will
automatically be published to the log.location
sensor. Otherwise, you can publish the value yourself by calling
entity.setAttribute(Attributes.LOG_FILE_LOCATION, getLogFileLocation());
in your ssh driver
Note: The log file will be on the server to which you have deployed Tomcat, and not on the AMP server. Let’s take a look in the log file:
As expected, we can see here that the unmatched-element
element has not been terminated in the server.xml
file.