We recently launched functional monitoring support in Nerrvana and we think it will be interesting for our readers to learn why Yandex (taking fourth place behind Google, China’s Baidu and Yahoo, according to ComScore) uses functional monitoring.
Post is translated with Mikhail Levin’s permission and its original (in Russian) is located on Habrahabr. Yandex also shared technical details of their system at Yet Another Conference 2012 – video and slides are available here (but also in Russian).
Here is it:
“Do you monitor your services in production? Whose responsibility it is in your company?
When we think of monitoring; server side developers, system administrators and DBAs often come to mind. They must watch the data processing queues and free disk space for the availability of individual hosts and their load.
Such monitoring really gives a lot of information about the service, but does not always show how the service works for an end user. Therefore, in addition to system monitoring, we, at Yandex, have created a functional monitoring system, tracking the state of the service from final interfaces – through the way the app looks and works in a browser, or how it works at the API level.
What is functionality monitoring in our understanding? To understand this better, let’s look at how things progressed for us.
It all began, of course, with regression auto testing. These auto-tests were also launched after release to test the service under real conditions. The fact that running regression tests in production sometimes finds bugs puzzled us.
What is it and why do we need it?
Why are functional tests, written for regression testing, and completing with no failures in testing, often fail in production?
We have identified a few reasons:
• Differences between the configuration of the test and production environments.
• Problems with the internal or external data suppliers.
• Hardware problems affecting the functionality.
• Problems that appear over time and/or at a specific workload.
If such tests can find problems, we decided that we need to try to running them regularly in production and monitor services’ status.
Let’s take a closer look at the problems functional monitoring tests can help us with.
A good example of a page which depends on the data providers is Yandex’s home page.
Weather and News, Events and TV shows, even a photo of the day with search suggestions is the data from internal and external data providers.
For example, in Arkhangelsk block Posters once looked like this:
While in Murmansk everything looked alright.
This is because the data supplier has not sent data for Arkhangelsk (or import has not updated on our side). Sometimes the problem is a one-off and, in some cases, KPIs can be formulated based on the percentage of available data and their freshness.
In our services, fault tolerance and performance plays an important role. Therefore, the team created a service with a distributed architecture and load balancing mechanisms. Failure of the individual pieces of hardware, as a rule, does not affect the user, but major problems with the data centers or routing between them are sometimes visible in the end user interfaces.
To trace the connection between hardware problems and functionality is exactly the thing functional monitoring help us with to complement the system monitoring.
For example, in Yandex.Direct we had a situation when a slowly “dying” server caused a gradual degradation of service, making it unavailable for some regions. Functional monitoring in this case served as a trigger for an emergency investigation and detection of the root of the problem.
Another interesting example are drills held in our company. During the exercises, one of the data centers is intentionally disconnected to make sure that this does not affect the health of the services we provide. It is also to train the staff to minimize time to identify possible problems and fix them. Data center failure does not harm services due to their tolerance, but functional monitoring in this situation helps to monitor systems’ behavior.
Service degradation over time
The use of an applications in production sometimes creates unforeseen situations. Causes of the problem may be a combination of volume, duration and type of load, or, for example, the accumulation of system errors, not identified in the testing phase. Setup and configuration errors of the infrastructure can cause problems, leading to service degradation of the system, or its failure.
If such problems cannot be identified in the testing phase, it is necessary to quickly identify them when they occur in production. Here the system and functional monitoring can complement each other to find problems and report them.
So functional monitoring is functional auto-tests, “sharpened” to search for specific problems, and continuously run in production.
There is a second component of the monitoring functionality – the way you process results.
The large flow of results that come from constantly running production tests must be aggregated and filtered. A system must promptly report any problems and at the same time minimise false positives. Also, there is the problem of integration of information from the functional monitoring into a single system performing health evaluation service which combines all monitoring results.
To avoid false positives, our system, built on the Apache Camel framework, allows the aggregation of several sequential results from one test into a single event. For example, you can configure filtering 3 out of 5, that will notify about a problem only if the test produced an error three times in five consecutive runs (you can specify, for example, filtering, 2 out of 2, or remove the filter – 1 out of 1). The frequency of test runs is also important in order for the filters you set to make sense.
Since we have many different services, our results are consumed by different departments: some failures are sent to testers, some reports contain data which managers are interested in, some results are sent and integrated into the overall monitoring system.
The idea of functionality monitoring is very simple, and such monitoring can be very effective for your business.
To ‘cook’ functional monitoring:
1. Assess which of your services fail in production and why.
2. Write (or select from existing) auto-tests for this bit of functionality.
3. Run these tests in production as often as you need and your monitoring system can cope with.
4. Process and notify the results, compare with other sources of information looking after the same system.
PS: For quite a long time we wanted to find out to what extent the idea of functional monitoring is spread and how it is used in other companies. Some people confirm that they use it, some want to implement once they hear the idea and some think that such monitoring is unnecessary, given the system is monitored in a ‘classical’ way.
How do you monitor the status of your production services, what tools do you use and how they are assembled?”