App Servers’ SRE Golden Signals
Part of the How to Monitor the SRE Golden Signals Series
The Application Servers are usually where all the main work of an application is done, so getting good monitoring signals from them is darned important, though the methods diverge widely based on the language and server.
In a perfect world, you would also embed good observability metrics in your code on the App Servers, especially for Errors and Latency, so you can skip the hardest parts of that in the following processes. In fact, its seems this is the only way in Go, Python, and Node.js.
Let’s tour our App Servers
PHP
PHP may not be as fashionable as it once was, but it still powers the majority of Internet sites, especially in E-Commerce (Magento), Wordpress, Drupal, and a lot of other large systems based on Symfony, Laravel, etc. Thus ignore it we cannot.
PHP runs either as a mod_php in Apache or more commonly as PHP-FPM, which is a standalone process behind Apache or Nginx. Both are popular, though mod_php scales poorly.
For Apache-based mod_php, things are simple, and bad. There are no good external signals, so all you have is the Apache Status Page and Logs as covered in the Web Server section. mod_php’s lack of metrics is yet another reason to use PHP-FPM.
For PHP-FPM we need to enable the status page, much as we did for Apache & Nginx. You can get this in JSON, XML, & HTML formats. See an Nginx guide.
You should also enable the PHP-FPM access, error, and slow logs, which can be useful for troubleshooting, and allow the Status Page to show a Slow Log event counter. The error log is set in the main PHP-FPM config file.
The access log is rarely used, but does exist and we need it, so turn it on AND set the format to include %d, the time taken to serve requests (this is done in the POOL configuration, usually www.conf, NOT in the main php-fpm.conf). The slow log is also set in this file. See this nice how-to page.
Finally, you should properly configure your php logs. By default, they usually go into the web server’s error logs but then they are mixed in with 404 and other web errors, making them hard to analyze or aggregate. It’s better to add an error_log override setting (php_admin_value[error_log]) in your PHP-FPM pool file (usually www.conf).
For PHP-FPM, our Signals are:
- Rate — Sadly, there is no direct way to get this other than read the access logs and aggregate them into requests per second.
- Errors — This depends a lot on where your PHP is logging, if at all. The PHP-FPM error logs are not that useful as they are mostly about slow request tracing, and are very noisy. So you should enable and monitor the actual php error logs, aggregating them into errors per second.
- Latency — From the PHP-FPM access log we can get the response time in milliseconds, just like for Apache/Nginx.
As with Apache/Nginx Latency, it’s usually easier to get from the Load Balancers if your monitoring system can do this (though this will include non-FPM requests, such as static JS/CSS files).
If you can’t get the logs, you can also monitor “Slow Requests” to get a count of slow responses, which you can delta into Slow Requests/sec. - Saturation — Monitor the “Listen Queue” as this will be non-zero when there are no more FPM processes available and the FPM system is saturated. Note this can be due to CPU, RAM/Swapping, slow DB, slow external service, etc.
You can also monitor “Max Children Reached” which will tell you how many times you’ve hit saturation on FPM, though this can’t easily be converted to a rate (as you could be saturated all day). But any change in this counter during sampling indicates saturation during the sampling period. - Utilization — We’d like to monitor the in-use FPM processes (“Active Processes”) vs. the configured maximum, though the latter is hard to get (parse the config), and you may have to hard-code it.
Java
Java powers a lot of the heavier-duty websites, larger e-commerce, and complex-logic sites, often for enterprises. It runs in a variety of diverse servers and environments, though mostly in Tomcat, so we’ll focus on that here. Other Java containers should have data similar to Tomcat, either via JMX or their Logs.
Like all app servers, the Golden Signals may be better monitored upstream a bit in the web server, especially as there is often a dedicated Nginx server in front of each Tomcat instance on the same host, though this is becoming less common. Load Balancers directly in front of Tomcat (or its fronting Nginx) can also provide Rate & Latency metrics.
If we’ll be monitoring Tomcat directly we first need to set it up for monitoring, which means making sure you have good access/error logging, and turning on JMX support.
JMX provides or data, so we have to enable this at the JVM level and restart Tomcat. When you turn on JMX, be sure it’s read-only and includes security limiting access to the local machine, with a read-only user. See a good guide, the Docs, and a good monitoring PDF.
In recent versions, you can also get JMX data via the Tomcat Manager and HTTP, though be sure to activate proper security on the Manager & interface.
Then our Tomcat Signals are:
- Rate — Via JMX, you can get GlobalRequestProcessor’s requestCount and do delta processing to get Requests/Second. This counts all requests, including HTTP, but hopefully HTTP is the bulk of these requests and it seems you can specify a Processor name as a filter, which should be your configured HTTP Processor name.
- Errors — Via JMX, you can get GlobalRequestProcessor’s errorCount and do delta processing to get Errors/Second. Includes non-HTTP stuff unles you can filter it by processor.
- Latency — Via JMX, you can get GlobalRequestProcessor’s processingTime but this is total time since restart, which when divided by requestCount will give you the long-term average response time, but this is not very useful.
Ideally your monitoring system or scripts can store both these values each time you sample, then get the differences and divide them — it’s not ideal, but it’s all we have. - Saturation — If ThreadPool’s currentThreadsBusy = maxThreads then you will queue and thus be saturated, so monitor these two.
- Utilization — Use JMX to get (currentThreadsBusy / maxThreads) to see what percentage of your threads are in use.
Python
Python is increasingly popular for web and API applications, and runs under a variety of App Servers, starting with the popular Django framework.
Now we also see Flask, Gunicorn, and others becoming popular, but unfortunately all of these are very hard to monitor — it seems most people monitor by embedding metrics in the code and/or using a special package / APM tool for this.
For example a module like Django / Gunicorn’s Statsd can be used to emit useful metrics that you can collect if you use statsd — several services such as DataDog can do this for you. But you still have to code the metrics yourself.
As with Node.js, there may be libraries or frameworks that can provide these metrics on a API, logging, or other basis. For example, Django has a Prometheus module.
If you don’t use one of these, you should ways to embed the Golden Signals in your code, such as:
- Rate — This is hard to get in code, as most things run on a per-request basis. The easiest way is probably set a global request counter and emit that directly, though this means shared memory and global locks.
- Errors — Similar to getting the Rate, probably using a global counter.
- Latency — The easiest to code, as you can capture the start and end time of the request and emit a duration metric.
- Saturation — This is hard to get in code, unless there are data available from the dispatcher on available workers and queues.
- Utilization — Same as Saturation.
If you can’t do any of these, it seems best to get our signals from the the upstream Web Server or Load Balancer.
Node.js
Like Python’s App Servers, everything in Node.js seems to be custom or code-embedded, so there is no standard way to monitor it to get our Signals. Many monitoring services like DataDog have custom integrations or support, and there are lots of add-on modules to export metrics in various ways.
There are some open source / commercially open tools such as KeyMetrics and AppMetrics that provide an API to get a lot of this data.
To make code changes or add your own Golden Signal metrics, see the summary under Python, above.
Otherwise, the best we can do without changing the code is get data from the upstream Web Server or Load Balancer.
Golang
Go runs its own HTTP server and is generally instrumented by adding emitted metrics for things you want to see, such as Latency and Request Rate. Some people put Nginx (which you can monitor) in front of Go, while others use Caddy, which can provide Latency via the {latency_ms} field, similar to Apache/Nginx, but has no great status or rate data (though there is a Prometheus plug-in that has a few).
To make code changes or add your own Golden Signal metrics, see the summary under Python, above.
Otherwise, the best we can do without changing the code is get data from the upstream Web Server or Load Balancer.
Ruby
Ruby remains quite popular for a wide variety of web applications, often as a replacement for PHP in more complex systems over the last few years. It’s similar to PHP in many ways, as it runs under an App Server such as Passenger that sits behind Nginx in reverse-proxy mode.
As with most of the app servers, it’s often better to get data from the web server or Load Balancers.
For Ruby running under Passenger, you can query the passenger-status page to get key metrics.
For Passenger or Apache mod_rails, our Signals are:
- Rate — Get the “Processed” count per Application Group and do delta processing on this to get requests per second.
- Errors — There is no obvious way to get this, unless it is in the logs, but Passenger has no separate error log (only a very mixed log, often shared with the web server, though you can set the log level to errors only).
- Latency — You have to get this from the Apache/Nginx access logs. See the Web Server section.
- Saturation — The “Requests in Queue” per Application Group will tell you the server is saturated. Do not use “Requests in top-level queue” as this should always be zero, according to the docs.
- Utilization — There is no obvious way to get this. We can get the total number of processes, but it’s not obvious how to tell how many are busy. As with mod_PHP or PHP-FPM, it’s important to monitor memory use as it’s easy to run out.
COBOL & FORTAN
Just kidding.
Next Service: MySQL Servers