Best Practices for Monitoring AWS
Tuesday, September 20th 2016 by Kirill Bensonoff
If you’re an MSP, your clients depend on you to monitor their cloud services and let them know if they are getting the optimal performance they’re paying for. Maybe you work in IT, and your “client” is your company. Either way, you should be monitoring their Amazon Web Services (AWS) 24 x 7 x 365 to ensure they’re notified in case of a problem.

Before you begin monitoring, there are a few questions they need to answer:

  • Do they want to monitor multiple instances and resources?

  • Which resources are critical to their business?

  • Who should be informed when there is a problem?

Why You Should Monitor Cloud Services
Save your client/company money (aka rightsizing)

Cloud platforms are scalable. As your client’s needs go up and down over the course of a year, it’s easy for them to buy more resources during peak times which they may not need later. By rightsizing, you can reduce costs for your clients by removing services they don’t need when business is down.

Improve AWS performance

By monitoring AWS performance, you can make the most out of their instances. In many cases, you can alert the client of a threshold being met, allowing them to make performance improvements before it becomes an issue.

Improved service availability

When you can alert your clients to issues as they occur, it allows them to fix service issues faster. Many times, they can be fixed before the customer is even aware of a problem.

What Should I Monitor?

Monitoring is extremely important, and we came up with a list of metrics for AWS that you should be aware of when creating your monitoring plans. Many of these metrics are custom, however, for some of them, we recommended a threshold to put in place, based on our experience.

1. Virtual machines

Every virtual machine (or instance) should be constantly monitored to make sure performance is within the proper threshold.

What you need to monitor:

Instance CPU utilization

Current percent of used processor utilization. We recommend keeping the CPU below 90% for extended periods of time. If it’s too low, we recommend resizing the instance; you may be overpaying.

Disk write bytes

Number of bytes written to the disk per second

Disk read bytes

Number of bytes read from the disk per second

Network in

Amount of data (counter) that flows into a network interface

Network out

Amount of data (counter) that flows out of a network interface

Disk read ops

Total number of completed read operations from store volumes to the instance

Disk write ops

Total number of completed write operations to all instance store volumes for the instance.

2. Relational Database Service (RDS)

While clients have the ability to modify database workloads, it is the MSP’s job to monitor the databases to make sure they are available and performing properly. If problems arise, the MSP will alert the client.

Understanding your client’s bottlenecks will allow you to right-size their database usage to meet their CPU, read, and write levels.

What you need to monitor:

RDS CPU utilization

Current percentage of processor utilization being used on an instance. We recommend this value stay under 90%.

Database connections

Total number of database connections currently being used

Disk queue depth

Amount of read and write requests that are waiting for disk access

Freeable memory

Amount of available RAM

Free storage space

Total amount of available storage space, in bytes. We recommend keeping this over 100MB.

Replica Lag

The amount of time a Read Replica DB instance lags behind the source DB instance. We recommend keeping this under 5 seconds.

Swap usage

Used swap space in the database

Read IOPs

Amount of read disk operations per second (average)

Write IOPs

Amount of write disk operations per second (average)

Read Latency

Time required for each disk read (in seconds)

Write Latency

Time required for each disk write (in seconds)

Read Throughput

Average bytes per second read from disk

Write Throughput

Average bytes per second written to disk

Network receive throughput

Total amount of network throughput being received on the interface

Network transmit throughput

Total amount of network throughput being written to the interface

3. Storage

Your client or company needs to know about disk storage issues before it’s too late. By monitoring these metrics, you can alert them before the problem occurs. If the client is paying for too much space, you have the opportunity to save them money.

What you need to monitor:

Volume read bytes

Total number of bytes read per second

Volume write bytes

Total number of bytes written per second

Volume read ops

Number of read operations for a specific period of time

Volume write ops

Number of write operations for a specific period of time

Volume queue length

Amount of read and write operations that are waiting to be processed. We recommend setting the threshold on this metric at 3 or below.

Volume idle time

Number of seconds when there was no read or write operations

4. Load balancers

Load balancers allow you to analyze your traffic patterns and troubleshoot issues with back-end applications. Using load balancers, you can improve application fault tolerance as well as improve performance by sending traffic to instances with greater availability.

What you need to monitor:

Healthy hosts

Number of hosts that are considered healthy

Unhealthy hosts

Number of hosts that are considered unhealthy. We recommend keeping the threshold for this at 0.


This non-zero value is used to determine which requests are taking longer than normal to process

Load balancer 4XX errors

4XX error codes from each load balancer that are created when requests are incomplete or malformed

Load balancer 5XX errors

5XX error codes from each load balancer that are created when requests are incomplete or malformed

Request count

Total number of requests received

Backend 2XX, 3XX, 4XX, 5XX errors

Total number of errors registered

Backend connection errors

Number of connection errors that were not successfully established

Conclusion: How You Should Monitor These Metrics

Using a cloud management platform like Unigma, you can easily monitor AWS performance out of the box. Learn more about how you can use Unigma to monitor cloud services for your clients today.
Request a Demo

One thought on “Best Practices For Monitoring AWS

Leave a Reply

Your email address will not be published. Required fields are marked *