Why You Should Monitor Azure (And Other IaaS providers)
The IaaS provider is growing and creating new opportunities for managed service providers. “The worldwide public cloud services market is projected to grow 16.5 percent in 2016 to total $204 billion, up from $175 billion in 2015,” says Gartner.
“In growing numbers, businesses are handing their apps over to public cloud providers, such as Amazon Web Services and Microsoft Azure. And while IT pros often view security as a reason to keep apps in house, performance monitoring may be another reason, as organizations struggle to maintain visibility when transactions move off-site. Public cloud monitoring tools, however, can help overcome these challenges,” says Paul Korzeniowski with TechTarget.
Once you know what to monitor, you can then use it to alert clients when there is an issue before they get that phone call telling them there is a problem.
Some good reasons to monitor Azure services include:
Monitoring day-to-day usage so you can find trends before they lead to problems
Making sure that systems remain available and healthy
Ensuring that service-level agreements are being met
Monitoring volume of work to be sure performance does not decrease as work increases
Once the system alerts (within seconds) that there is an issue, action can be taken to rectify the problem.
You and your clients need to understand how IaaS providers are impacting the performance and availability of their cloud applications. Some factors that might impact performance are buggy applications, a spike in web traffic, internal database process changes, etc.
In addition to delivering optimal performance, monitoring these metrics will allow them to know when demand decreases so they can cut costs by reducing services they don’t need. Then, when things start to ramp up again, they have the ability to scale up with vendors to meet peak traffic periods.
What You Must Monitor (Or Risk Having Your Clients Exposed)
For the purposes of this guide, we will cover monitoring of four types of resources: VMs (instances), storage, web apps and databases.
Here is a breakdown of the items that you need to continuously monitor and alert when thresholds are met:
– Monitoring Central Processing Unit (CPU) metrics will not only show you how your processors are utilized, but also how much is being employed by user applications. It also shows how much time the processor spends in the restricted user mode, which is where applications are run.
– While it is possible for a system to be in good health and frequently run with high CPU utilization, it is important to alert the client that the CPU are close to saturation.
What to monitor:
– CPU percentage: percentage of CPU utilization (best practice in the cloud is 70%-90%)
– CPU user time: percentage of time the CPU is in user mode
– CPU privilege time: percentage of time the CPU is in kernel mode
– CPU idle time: amount of time CPU has been idle
– Monitoring how memory is being used on VMs will help you find bottlenecks in performance and potential low memory problems before they occur.
What to monitor:
– Memory available: free memory available to the system
– Memory committed: amount of pages per second that have been written to or retrieved from disk
– Memory percentage: percentage of memory being used (best practice under 95%)
– Memory percent available: percentage of memory available
By default, Azure shows the network traffic in and out of a virtual machine. Depending on your operating system, you may have network metrics in TCP segments (Microsoft) or bytes (Linux) per second.
Monitoring sent bytes (Linux) or TCP segments (Windows) can alert you that a network is reaching saturation. This will allow your client to investigate a potential performance issue.
Monitoring received bytes (Linux) or TCP segments (Windows) can also alert you if they plummet that an application may be overloaded.
What to monitor:
– Network in: the bytes or TCP segments received per second
– Network out: the bytes or TCP segments sent per second
– TCP connections established: the current number of TCP connections to the VM
– TCP connections failed: the number of connections that are currently failing to connect to the VM
– Web current connections: the number of external connections to the web interface for the VM (best practice is under 10)
– Disk I/O
By monitoring disk I/O, you can get a better understanding of how your applications are affecting hardware.
When you monitor the disk read bytes, it helps you grasp the dependence of your application on the disk storage. If you find that the application has too many reads per disk, it can indicate that you need to add a caching layer.
Monitoring the disk write bytes can identify I/O bottlenecks. If you consistently see bottlenecks with a VM, it may be time to upgrade your VM so you can increase the maximum number of input/output operations per second (IOPS).
What to monitor:
– Disk read bytes: bytes read from disk per second
– Disk write bytes: bytes written to disk per second
– Swap percent available: percentage of space available for the disk swap file (best practice is over 5%)
– Disk queue length: requests that are waiting for processing by the disk (best practice is under 15)
Monitoring your web apps can allow you to alert clients if their website goes down or does not respond as intended. You can also use it to monitor the speed of the web app and how quickly it responds to a request.
With proper monitoring, your client will likely be the first one to know when the website is having a problem. It won’t be an angry customer that calls to say they’ve been trying for an hour (okay, maybe they’re exaggerating) and can’t get connected to the site. Instead, the client will know their site is having issues long before the users know, and hopefully they can fix it before getting any complaints.
By monitoring the number of HTTP requests that are received within a specified period you can then compare this over time to get a better understanding of how your application behaves as the load goes up and down.
What to monitor:
– CPU time: amount of time the CPU spends to process and request information (best practice under 90%)
– Data in: amount of data being read by the application
– Data out: amount of data being sent by the application
HTTP server errors: number of web server errors (keep this under 10 within 5 minutes)
Requests: the number of requests that have arrived for the web application (we like to monitor this to see web apps that get no requests, so that we know they may not be used)
Important note: If you are using Azure quotas for memory, CPU time, or data out, when the quote is exceeded, Azure will stop your web app until the next quota interval begins.
Will your client know before lack of storage becomes a problem? With potentially thousands of people using an IaaS application at once, it is easy for databases to quickly spiral out of control. Your client needs to know about the issue before it’s too late.
Monitoring storage makes it easier to track a client’s queues, tables, and blobs. This can be helpful in allowing them to find storage bottlenecks before they become a problem.
Applications use queues to communicate with each other. If the queue availability goes down, it can slow down the entire application.
What to monitor:
– Queue availability: the availability of the application queue
– Table availability: if a table becomes unavailable, it can potentially delay the entire application
– Blob availability: the availability of a blob (unstructured text or binary data)
– Queue success percentage: the percentage of successful requests to queue (this should be greater than 90%)
– Blob success percentage: the percentage of successful requests to blog (this should be greater than 90%)
– Table success percentage: the percentage of successful requests to table (this should be greater than 90%)
– Queue total requests: the total number of requests queued
– Blob total requests: the total number of requests for unstructured binary blob data
– Table total requests: the total number of table requests
Database bottlenecks can quickly become performance problems. Clients have the ability to calibrate performance levels and govern them so that needed resources are available to your database workload. However, it’s your job as an MSP to make sure they are properly monitored.
Important: If a threshold is met, the database will continue to handle the workload, but you are likely to see increased latencies in a variety of areas as a result. Sometimes, this can make the root of the problem difficult to troubleshoot. While these bottlenecks may not cause actual errors, they can slow down your workload and cause queries to timeout.
Knowing your client’s database performance levels can also help you find out if you can fit them into a lower Azure performance level and still maintain CPU, reads, and write levels.
What to monitor:
– Database CPU utilization: percentage of CPU utilization
– Overloaded database CPU can cause slower performance and delay queries.
– Successful connections: number of successful connections to the database
– If the number of successful connections dramatically falls, this can indicate people are not able to get to your application.
– If the number grows too quickly it can indicate a potential overload.
– Sessions percentage: the percentage of concurrent sessions for the database (this should be under 90%).
– Worker percentage: the percentage of concurrent requests for the database (this should be under 90%).
DTU used percentage: database throughput unit (DTU) describes “the relative capacity of a performance level of basic, standard, and premium databases. DTUs are based on a blended measure of CPU, memory, reads, and writes. (this should be kept under 90%)
Deadlock: a field or table that has been locked by another process
Failed connections: the number of connections that are failing to connect to a database (keep this under 2 in a 5 minute interval)
Log I/O percentage: the percentage of I/O logging places on the system. (keep under 90%)
Database size percentage: “This metric measures the ratio between used database space and total available space, taking into account the database file’s current used space, current size, maximum size, and free space for each drive where database files are located. The actual available space per drive is calculated first, followed by the sum of the total used space from drives and the total available space from drives. These values are then used to calculate the global percentage,” says Blaž Dakskobler with SQL Monitor Metrics. (keep under 90%)
How To Monitor These Metrics
Now that you know which metrics to monitor, the question becomes how you should monitor them. Sure, you can go to Microsoft Azure and select to monitor one instance. However, what happens when a customer has multiple instances, or if you have multiple customers/departments/projects to monitor? It gets complicated.
Cloud management platforms like Unigma make it easy to monitor Azure performance, right out of the box. Check out this article if you want to learn about how to use Monitoring Templates to make monitoring easy.