3-3 Policies

If something happens, we need to fix a problem or raise an alert. Most organizations have a lot of important tools in this space. AMP aims to standardize and simplify their use in an organization, ensuring they are baked in blueprints, and pulling together the results so that all stakeholders get a holistic view.

A “policy” in Cloudsoft AMP consists of three aspects:

Monitor: when should it run, usually either time-based or sensor-based
Analyze/Plan: what checks should it do, ranging from a trivial condition or even always-apply, to sophisticated computation running external tools
Execute: what action should it apply based on the analysis, often invoking an effector or emitting a sensor

In common parlance, a policy is often just a statement of what is and isn’t allowed. In order to be runnable, AMP’s philosophy is that a policy should include details of the monitoring, the analysis/planning, and the resulting execution. A policy statement that doesn’t specify when it needs checked (monitor) cannot be automated; similarly, a policy statement that doesn’t specify how it is evaluated (analyze/plan) cannot be automated; and a policy statement that doesn’t specify a consequence (execute) is useless!

In practice, these three aspects are simple to define and are often baked in to the policy being used, so that a blueprint author has minimal work to do. We will explore this with several in-life management policies building on the previous exercise. This is a short exercise – about 30 minutes – adding the policies in quick succession, then redeploying and exploring them.

Built-in policies: terraform drift

Some types in AMP have built-in policies. The Terraform type, for example, allows enabling drift detection: every minute (monitor), check that the Terraform template matches actual resources (analyze/plan), and emit a sensor indicating compliance or failure (execute).

To enable this, simply add the following to the two terraform entities:

    brooklyn.initializers:
      - type: terraform-drift-compliance-check
        brooklyn.config:
          terraform.resources-drift.enabled: true

Automation: scheduling actions

The next policy – the cron-scheduler installed as a policy to the AMP Catalog – allows us to specify that effectors should be invoked according to a schedule: the schedule is the monitor aspect, the analyze/plan aspect is trivial because it always applies (although it could be enhanced to support conditions), and the effector specified by the blueprint author is the execute aspect.

Here let’s attach a schedule to the EFS bastion server. It normally isn’t needed out-of-hours, and it will save money and energy, and be more secure, to run it only during the day. Because that Terraform template is a separate entity in AMP, we can control its lifecycle separately by attaching this policy to that entity, instructing it to start at 8.30am and stop at 6.00pm M-F.

This can be done by adding the following block to the “Bastion Server” terraform entity, using a brookyn.policies block rather than brooklyn.initializers because we are using a policy saved in the catalog:

    brooklyn.policies:
      - type: cron-scheduler
        brooklyn.config:
          entries:
            - when: '0 30 8 * * mon-fri'
              effector: start
            - when: '0 0 18 * * mon-fri'
              effector: stop

Compliance: monitoring elastic storage use

Previously, we created two efs-size-* sensors. One simple health and compliance check is to ensure that the size of data in EFS is reasonable. Let’s add a new policy that checks if the size is too big, and a policy that if the size is too big, publish a non-compliance dashboard sensor and open an issue.

In this case we are computing a sensor (what to execute), with a declared set of trigger sensors (the monitor) and conditional logic in AMP about whether it should run and in the script about the sensor’s value (the analysis/plan). The following is added at the root of the blueprint:

brooklyn.initializers:
  - type: workflow-sensor
    brooklyn.config:
      sensor:
        name: dashboard.utilization.filesystem_size
        type: compliance-check
      triggers:
        - efs-size-from-aws
        - efs-size-from-server
      condition:
        any:
          - sensor: efs-size-from-aws
            greater-than: 30000000
          - sensor: efs-size-from-server
            greater-than: 30000000
          - sensor: dashboard.utilization.filesystem_size
            check:
              jsonpath: pass
              when: falsy
      steps:
        - step: container cloudsoft/terraform
          env:
            SIZE_PER_AWS: ${entity.sensor['efs-size-from-aws']}
            SIZE_PER_SERVER: ${entity.sensor['efs-size-from-server']}
          command: |
            FAILURES=""
            if [ "$SIZE_PER_AWS" -gt 30000000 ] ; then
              FAILURES=$(echo $FAILURES AWS)
            fi
            if [ "$SIZE_PER_SERVER" -gt 30000000 ] ; then
              FAILURES=$(echo $FAILURES on-box-check)
            fi

            cat <<EOF
            id: efs-size-check
            created: $(date +"%Y%m%d-%H%M")
            EOF

            if [ -z "$FAILURES" ] ; then cat <<EOF
            summary: EFS size within bounds
            pass: true
            EOF
            else cat <<EOF
            summary: EFS size limit exceeded - $FAILURES
            pass: false
            EOF
            fi

            cat <<EOF
            notes: |2
              AWS reports size as: $SIZE_PER_AWS
              On-box reports size as: $SIZE_PER_SERVER
            EOF
        - transform out = ${stdout} | yaml | type compliance-check
        - return ${out}

There are a several new things going on here:

It will be triggered whenever either of the efs-size-* sensors are updated
It will only run if one of a few conditions are met, using the Predicate DSL to define the conditions as either of sizes exceeding 30 mb or if this compliance-check sensor previously failed (there is no reason to execute if both sizes are below the max and the previous check reported as passed, although it could)
The jsonpath argument in the condition tells AMP to convert the sensor to JSON and retrieve a specific field within it, so it will look at the pass argument from the previous output
The when: falsy clause tells AMP not just to match an explicit false value but anything which commonly indicates an absence of explicit truth, specifically if it is missing (or null, 0, or "")
If the conditions are met, two shell environment variables are initialized to the sensor’s values and passed to the script
The bashScript checks whether either size is exceeded and outputs YAML that corresponds to the compliance-check type, either passing or failing; the resulting object is emitted as the dashboard.utilization.filesystem_size sensor
Finally, sensors prefixed dashboard. and sensors of the compliance-check or dashboard-info types get special treatment: these are aggregated upwards in the management hierarchy and reported in the AMP dashboard, so any non-compliance anywhere in a deployment is rapidly flagged in the UI and any such non-compliance can be used to trigger other events (such as a ServiceNOW incident or an alert)

Type-coercion, TOSCA, and scalar-unit.size type ⌃

Security: scanning servers

The previous example was an “off-box” policy; we can also produce dashboard compliance-check sensors by running policies “on-box”. In this example, we will use the open-source security scan tool lynis to ensure the bastion server is compliant with security best practices, this time running periodically (the trigger), and using a pre-supplied script (the analysis/plan) to generate the value of the sensor that is published (what the policy executes).

This should be added to the existing member.initializers block in the “EFS Bastion Server (grouped)” policy to apply it to discovered bastion servers:

                - type: workflow-sensor
                  brooklyn.config:
                    sensor:
                      name: dashboard.security.lynis
                      type: compliance-check
                    steps:
                      - load script = classpath://io/cloudsoft/amp/compliance/lynis/lynis-result-sensor.sh
                      - type: ssh
                        command: ${script}
                      - transform out = ${stdout} | yaml | type compliance-check
                      - return ${out}
                    period: 5m

Putting it all together

name: EFS with Bastion Server and Policies brooklyn.parameters: - name: demo_name default: cloudsoft-amp-efs-exercise constraints: - regex: '[A-Za-z0-9_-]+' pinned: true - name: bucket_name constraints: - required - regex: '[A-Za-z0-9_-]+' - name: authorized_keys description: > This optional parameter allows additional authorized_keys to be specified and installed when bastion servers come up or whenever the `authorizeSshKey` effector is invoked. pinned: true reconfigurable: true brooklyn.config: tf_var.name: $brooklyn:config("demo_name") shell.env: AWS_ACCESS_KEY_ID: $brooklyn:external("exercise-secrets", "aws-access-key-id") AWS_SECRET_ACCESS_KEY: $brooklyn:external("exercise-secrets", "aws-secret-access-key") tf.extra.templates.contents: backend.tf: | terraform { backend "s3" { bucket = "${config.bucket_name}" key = "${config.demo_name}/${name}" region = "eu-west-1" } } services: - name: EFS Volume type: terraform id: efs-volume brooklyn.config: tf.configuration.url: https://docs.cloudsoft.io/tutorials/exercises/3-efs-terraform-deep-dive/3-1/efs-volume-tf.zip brooklyn.initializers: - type: terraform-drift-compliance-check brooklyn.config: terraform.resources-drift.enabled: true brooklyn.children: - type: org.apache.brooklyn.entity.group.DynamicGroup name: EFS Volume (Grouped) id: efs-volume-grouped brooklyn.config: dynamicgroup.entityfilter: config: tf.resource.address equals: aws_efs_file_system.main brooklyn.policies: - type: org.apache.brooklyn.entity.group.GroupsChangePolicy brooklyn.config: member.initializers: - type: workflow-sensor brooklyn.config: sensor: name: efs-size-from-aws type: integer triggers: - tf.value.size_in_bytes steps: - transform size_json = ${entity.sensor['tf.value.size_in_bytes']} | bash - container stedolan/jq echo ${size_json} | jq '.[0].value' - return ${stdout} on-error: - clear-sensor efs-size-from-aws brooklyn.enrichers: - type: org.apache.brooklyn.enricher.stock.Aggregator brooklyn.config: enricher.sourceSensor: efs-size-from-aws enricher.targetSensor: efs-size-from-aws transformation: first # list, sum, min, max, average - name: Bastion Server for accessing EFS type: terraform id: efs-bastion-server brooklyn.config: tf.configuration.url: https://docs.cloudsoft.io/tutorials/exercises/3-efs-terraform-deep-dive/3-1/efs-server-tf.zip tf_var.subnet: $brooklyn:entity("efs-volume").attributeWhenReady("tf.output.efs_access_subnet") tf_var.efs_security_group: $brooklyn:entity("efs-volume").attributeWhenReady("tf.output.efs_access_security_group") tf_var.efs_mount_dns_name: $brooklyn:entity("efs-volume").attributeWhenReady("tf.output.efs_access_mount_dns_name") tf_var.ami_user: ec2-user brooklyn.initializers: - type: terraform-drift-compliance-check brooklyn.config: terraform.resources-drift.enabled: true brooklyn.policies: - type: cron-scheduler brooklyn.config: entries: - when: '0 30 8 * * mon-fri' effector: start - when: '0 0 18 * * mon-fri' effector: stop brooklyn.children: - type: org.apache.brooklyn.entity.group.DynamicGroup name: EFS Bastion Server (Grouped) id: efs-bastion-server-grouped brooklyn.config: dynamicgroup.entityfilter: config: tf.resource.address equals: aws_instance.efs_bastion brooklyn.policies: - type: org.apache.brooklyn.entity.group.GroupsChangePolicy brooklyn.config: member.locations: - type: org.apache.brooklyn.location.ssh.SshMachineLocation brooklyn.config: user: $brooklyn:config("tf_var.ami_user") address: $brooklyn:parent().attributeWhenReady("tf.output.efs_bastion_server_ip") privateKeyData: $brooklyn:parent().attributeWhenReady("tf.output.efs_bastion_server_private_key") member.initializers: - type: workflow-sensor brooklyn.config: sensor: name: efs-size-from-server type: integer period: 2m steps: - ssh du --block-size=4096 /mnt/shared-file-system | tail -1 | awk '{print $1 * 4096}' - let result = ${stdout} - ssh date > /tmp/cloudsoft-amp-efs-size-from-server.last-date - return ${result} - type: workflow-sensor brooklyn.config: sensor: name: dashboard.security.lynis type: compliance-check steps: - load script = classpath://io/cloudsoft/amp/compliance/lynis/lynis-result-sensor.sh - type: ssh command: ${script} - transform out = ${stdout} | yaml | type compliance-check - return ${out} period: 5m - type: workflow-effector brooklyn.config: name: authorizeSshKey parameters: PUBLIC_SSH_KEY: description: SSH key (public part) to authorize on this machine defaultValue: $brooklyn:config("authorized_keys") steps: - ssh echo ${PUBLIC_SSH_KEY} >> ~/.ssh/authorized_keys member.invoke: - authorizeSshKey brooklyn.enrichers: - type: org.apache.brooklyn.enricher.stock.Aggregator brooklyn.config: enricher.sourceSensor: efs-size-from-server enricher.targetSensor: efs-size-from-server transformation: first brooklyn.enrichers: - type: org.apache.brooklyn.enricher.stock.Propagator brooklyn.config: producer: $brooklyn:entity("efs-volume-grouped") propagating: - efs-size-from-aws - type: org.apache.brooklyn.enricher.stock.Propagator brooklyn.config: producer: $brooklyn:entity("efs-volume") propagating: - tf.output.efs_access_mount_dns_name - type: org.apache.brooklyn.enricher.stock.Propagator brooklyn.config: producer: $brooklyn:entity("efs-bastion-server-grouped") propagating: - efs-size-from-server - type: org.apache.brooklyn.enricher.stock.Propagator brooklyn.config: producer: $brooklyn:entity("efs-bastion-server") propagating: - tf.output.efs_bastion_server_ip brooklyn.initializers: - type: workflow-sensor brooklyn.config: sensor: name: dashboard.utilization.filesystem_size type: compliance-check triggers: - efs-size-from-aws - efs-size-from-server condition: any: - sensor: efs-size-from-aws greater-than: 30000000 - sensor: efs-size-from-server greater-than: 30000000 - sensor: dashboard.utilization.filesystem_size check: jsonpath: pass when: falsy steps: - step: container cloudsoft/terraform env: SIZE_PER_AWS: ${entity.sensor['efs-size-from-aws']} SIZE_PER_SERVER: ${entity.sensor['efs-size-from-server']} command: | FAILURES="" if [ "$SIZE_PER_AWS" -gt 30000000 ] ; then FAILURES=$(echo $FAILURES AWS) fi if [ "$SIZE_PER_SERVER" -gt 30000000 ] ; then FAILURES=$(echo $FAILURES on-box-check) fi cat <<EOF id: efs-size-check created: $(date +"%Y%m%d-%H%M") EOF if [ -z "$FAILURES" ] ; then cat <<EOF summary: EFS size within bounds pass: true EOF else cat <<EOF summary: EFS size limit exceeded - $FAILURES pass: false EOF fi cat <<EOF notes: |2 AWS reports size as: $SIZE_PER_AWS On-box reports size as: $SIZE_PER_SERVER EOF - transform out = ${stdout} | yaml | type compliance-check - return ${out}

Deploy this as before by unmanaging the previous deployment, copying this blueprint to the composer, setting the bucket_name and demo_name, and optionally putting your public key in the authorized_keys parameter.

Once deployment is complete, let’s explore the various policies. This time, we’ll use the “Dashboard”. The application should show as compliant, with some details:

screenshot

However we can trigger violations for each:

Open a port on the security group in AWS to trigger drift violation
Write another large file to EFS to trigger a size violation, such as dd if=/dev/random of=/mnt/shared-file-system/big-file-2.bin bs=4k iflag=fullblock,count_bytes count=40M
Do something naughty on the bastion sever to trigger a Lynis violation, such as sudo chmod 666 /etc/rc.d/rc.local

screenshot

To see more detail, you can switch to the inspector and use the Management tab to see summary information for the policies and enrichers that are running, and drill down to see individual activity and log messages for everything it is doing.

To test the cron-scheduler policy, simply leave AMP running until after 6pm. (If it’s already after 6pm, manually stop the bastion server once you’ve done the above tests and wait until 8.30am to see it start.)

Faking time ⌃

Fixing problems, manually and automatically

In some cases, the solution to a problem detected is obvious:

For the extra port on the security group, simply run the apply effector on that terraform entity in AMP. That violation should be cleared up.
For the non-compliant server, since it’s effetively stateless, we can just restart that terraform entity, tearing it down and re-creating it. Or we can taint the aws_instance, then apply the terraform, to re-create just that resource.

In other cases, such as if the EFS size is too large, that might require manual remediation.

In all cases, AMP aims to give the users the right balance of automation and visibility to enable automatic or manual remediation, alerting, and observability. Key information is surfaced in ways that stakeholders can use it without needing to be technical experts in AWS, or Terraform, or Kubernetes, or ServiceNOW. Subject-matter experts who need deeper insight can use AMP to navigate to that information more quickly. And these policies – for compliance, utilization, reporting, anything – can be standardized and re-used.

Tidying up

Now you have learned a rich set of AMP basics.

To tear down the deployments, click the stop effector on the root applications in AMP rather than Unmanage, apart from the “S3” application. The “S3” application cannot be destroyed by Terraform until the state files it created are removed. (Terraform leaves near-empty state files in S3 even after a terraform destroy.) You can simply delete the bucket in AWS (manually) or delete all the files in that bucket, and then you will be able to stop the application in AMP. Or you can use what you’ve learned to add a workflow-effector that uses a container step to run aws s3 rb s3://${BUCKET_NAME} --force, picking a container image with the aws CLI installed and passing the bucket name and AWS credentials as env variables.