CloudTest Health Settings

Document created by DPM Admin Employee on Jul 20, 2017Last modified by B-F-F08DRX on Nov 14, 2017
Version 2Show Document
  • View in full screen mode

Testing at scale is the best way to understand where a system is going to fail and what is going to fail first. A robust testing environment is needed to generate the load and display results in real time. CloudTest is known for its ability to generate load tests at very large scales. However, even with the resources of the cloud, it is necessary to scale testing infrastructure to properly support the level of load generation and result aggregation in load tests.

Accurate measurements of response times are important in load tests in order to have confidence in results and to compare results between runs. Overloaded test infrastructure can lead to inaccurate response time measurement, or eventual failure of test infrastructure components. This puts test results, test effort, and any resulting actions from these tests at risk. For accurate and reliable results, test infrastructure should maintain resource utilization in a healthy operating range - like any other system under load.

For these reasons, over the last few releases, SOASTA has added various limits and settings to detect resource issues and to gracefully stop a test and capture the results in cases where issues occur. These thresholds are configurable, and designed to preserve your test results and their accuracy.

Running without these thresholds in place is not recommended. If you are encountering these thresholds, it is most likely that your test infrastructure is overloaded. You should investigate these conditions, to make sure your test results are accurate and reliable. Another option besides removing these thresholds is to change them.

Load Generation

Monitor Memory: you can set CloudTest to take action if a load generator (Maestro) exceeds a threshold for memory consumption. This is set in the Maestro Service of your main or any load generator using the setting Monitor.FreeMem.Threshold.

You can choose the server you want to modify by going to the server list, choosing the server, and then going to the Settings tab.

You can set it after you launch a grid, or if you plan to use the same settings on a consistent basis you can set it in the appropriate Server Classes. You can also duplicate one or more out of the box Server Class by going to Server Classes and create a new Server Class with one or more persistent settings.

By default, the threshold is 3. You can update this by setting the amount of available free memory, as a percentage of total memory, below which the Maestro will be considered “Unhealthy”. At 3 that means if memory goes above 97% used the action chosen in the Monitor.UnhealthyAction setting. Memory is checked after each garbage collection.

For Monitor.UnhealthyAction there are two choices: DRAIN and STOP. If you choose STOP the composition in which that load generator is running will stop, the same as if you hit the Stop button in CloudTest.

DRAIN will only stop server(s) found to be unhealthy, in this case that exceed the memory threshold. DRAIN lets the currently playing virtual users finish their jobs, however, no new VUs are created. This will happen for all comps running on the load generator since the monitor does not know which comp is causing the issue. If the setting is DRAIN, you will be notified in the spinner in the top right and in the event log that unhealthy servers are being stopped.

If you set the value to -1 the check will be disabled. You may do that if you expect to experience periodic spikes that you know won’t bring down the load generators, but you’ll probably want to keep an eye on them during the test.

Monitor Load Average: you can set CloudTest to take action on a load generator (Maestro) by assessing the total number of threads waiting for a defined period of time: 1, 2, 5 or 15 minutes. For single-CPU systems that are CPU bound, one can think of load average as a percentage of system utilization during the respective time period. For systems with multiple CPUs, one must divide the number by the number of processors in order to get a comparable percentage.

This is set in the Maestro Service of your main or any load generator using the setting Monitor.LoadAvg.xMinThreshold settings. The setting is specific to the class of machine and the number of CPUs, and has a higher tolerance when measured over shorter periods. The setting default setting is per CPU:

  • 1 Min - 50
  • 2 Min - 25
  • 5 Min – 15
  • 15 Min - 5

The above is multiplied by the number of CPUs. For example, and AWS Large has 2 virtual CPUs so it’s settings will be 100, 30, and 10.

As with the memory monitor, if the number of threads in the setting is waiting for the duration of the setting then the action chosen in the Monitor.UnhealthyAction setting will be invoked, either to DRAIN the affected server(s) or STOP the test.

As with the memory settings, -1 will disable the load average check. Consider increasing the 5 and 15 minute averages to 20 and 12 respectively before disabling this check

Results Service

There are a number of ways tests might be modified to avoid putting too much stress on the results service. For example, you don’t want to generate a large number of unique errors, such as having a session ID in an error text. The system does not allow more than 32,000 unique errors, and the practical limit for most implementations is going to be less than that.

Another place to look is Container elements, such as Clips, Transactions and Groups, which cause the system to calculate statistics about each of these containers while the test is running. Heavy nesting of very fast running containers can create a lot of overhead. For example, a Clip which contains a Transaction which contains a Group which contains another Transaction which contains a single, fast running HTTP message, will generate a lot of metrics, which may, depending on the size and duration of the test, put a lot of stress on the system.

In addition, there are features and settings to help put the brakes on before a server is overwhelmed.

Max URLs: A large number of unique URL's could also stress the results service. The setting Writer.MaxURLDimensions governs the maximum unique URL's for analytics. If this number is exceeded then all other URL's will be set to "Other". Reducing this number will help reduce memory usage. The default is 30,000.

Alternatively, the composition has a property which governs how URL's are tracked for analytic purposes. Setting this to "Protocol and Host Names Only" can greatly reduce the memory overhead with computing statistics for URL's.

Max Property Values: A large number of unique property values can also use up resources. Creating custom property values that have the "save value for analytics" checkbox checked causes the system to use memory for each property value. Checking this box if the property values contain something unique, such as session IDs, can cause excessive memory to be consumed. The Writer.MaxPropertyValueDimensions setting governs the number of unique dimensions. If the number goes above setting, which is default at 30,000, remaining values will be set to other.

Results Throttle: The Results Service can consider itself to be under stress primarily due to exceeding one of two potential limits. Similar to the load generator Monitor.Unhealthy.Action, which kicks in under certain monitored limits, the writing of results to the database will be throttled under specific conditions. If the Database starts to become overwhelmed it will force the Consolidators to slow down, having them write to the local Results Consolidator disk, if necessary. The two settings that will trigger the throttle are:

  • Database: The HealthCheck.Database.Max.Problem.Percent has a default 30%. If the percentage of database update transactions that exceed a normal time limit over a 5 minute period exceeds this limit then the system considers itself under stress.
  • Memory: The HealthCheck.FreeMem.Threshold has a default of 10%. If the system has less than this percent of free memory then it will consider itself under stress. You can also set the frequency at which this threshold is checked.

While this will minimize the chance that the RSDB will experience a failure, it also means the dashboards will slow down and it will likely take quite a while to complete writing the results. Also, a very fast spike can still overwhelm the results server. If this occurs, along with looking at the composition of your test, you may want to deploy a larger RSDB.


Encountering Resource Thresholds

If you have encountered a resource threshold, you might see a status message like one of these:

Stopped due to high system load average: system {3}-minute load average at {0}, over threshold of {1}, on {2}.


Stopped due to low free memory available: system free memory at {0}%, below threshold of {1}%, on {2}.