As software testing becomes a pillar of DevOps adoption, modern test automation platforms are expanding their range of functionalities so that quality engineering teams can do more within shorter development cycles. These new capabilities are becoming the fundamental distinction between script-based testing frameworks and automated testing platforms; the former simply help QE teams execute more tests, the latter transforms testing results into data that enables QE to take on a leadership role in DevOps and continuously improve the customer experience.
Mabl is designed to help all quality professionals - regardless of coding experience - maximize the potential of software testing. Though we’re always focused on being the easiest solution for test automation, we’re also driven by the tester’s need for better insights and better reporting. Today, we’re diving into a recent mabl-on-mabl scenario that demonstrates how test data from mabl can be used for troubleshooting test environments.
Failed Tests Create Questions
Recently, the mabl engineering team noticed increased failure rates in our unit and end-to-end tests. We used mabl and other data sources extensively to diagnose and verify the fix for the issue. The process we used to investigate and resolve the issue can be replicated by other mabl users, and we hope that sharing our experience helps other teams identify and resolve environment issues in the future.
We first identified the issue in our GitHub Actions CI pipeline, where some build jobs would fail and then pass upon retry. Most of the failures were due to test failures and timeouts, yet our build logs didn’t show a pattern in the failures, which occurred on different tests and different steps each run.
To confirm that this was indeed a systemic issue, we checked the testing dashboard for our QA environment, which highlighted the increased failure rate in that environment.
Failure rate metric from Google DataStudio via mabl BigQuery Export
At this point, we knew that we had a problem worth investigating. We didn’t see a similar trend in our production environment and tests were passing consistently against our local builds, so our hypothesis was that the issue was related to our QA environment.
Narrowing Down Potential Causes
To test that hypothesis, we logged into mabl and inspected the test results for our primary smoke testing plan that runs against our QA environment. The first failure was a login issue, which was strange because the test had passed in both prior and subsequent runs.
Screen shot captured by mabl during test run
The next failure looked like a timeout: the page was loading for over one minute, resulting in a failure in the test assertion.
Screen shot captured by mabl during test run
The apparent timeout was a red herring since we started to investigate potential latency issues and found no evidence to support general slowness. This was verified by our average page load time, which had been flat (or even decreasing) in most recent tests.
App speed index captured for each test step and aggregated by mabl for each test
Likewise, the overall execution time for tests that passed appeared to be getting faster, not slower.
Connecting the Dots
Test run time from Google DataStudio via mabl BigQuery Export
Digging in on the timeout failure above, we noticed that the previous “Save” step included a warning. Reviewing that step, we saw that mabl had detected a 502 “Bad Gateway” response to an API call from the browser. Given the intermittent nature of the failures, we suspected that something was causing periodic connectivity issues, rather than an actual network configuration issue.
Network response logs captured by mabl during test run
We also checked request logs from the QA environment, and we could clearly see a significant increase in errors (501-599) over the past two months. But we didn’t see a corresponding increase in production.
Aggregated request log info from Google BigQuery
Reviewing our test results, we noticed that the failure rate appeared to be higher for runs that were triggered by a deployment (via API) as compared to tests that were run periodically. Given that we had recently increased the number of tests configured to run on deployment, we developed a hypothesis that the QA environment was struggling to handle the increased load generated by hundreds of parallel test runs.
We then reviewed the configuration of our QA environment in our cloud console. We noticed that we were using smaller instances to power our QA API compared to our production API and that the instances were scaling up rapidly with each deployment. The image below shows that the QA API (in green, at bottom) was typically powered by three instances and scaled up to 6 (the maximum per the configuration) during each deployment but 4-5 during periodic (non-deployment) plan runs.
Instance count over time for QA and Dev APIs from cloud provider console
Implementing a Fix
We changed the QA instance types to match production, increased the maximum instance count, and triggered additional deployment runs. Those deployment runs were successful, confirming our hypothesis that the additional testing load was exhausting the resources allocated to the QA environment. Since we made the change, the failure rate has been consistently under 2 percent.
Failure rate metric from Google DataStudio via mabl BigQuery Export
We hope this post sheds light on how test data in mabl can be used to investigate and resolve issues in the testing environment. Much of this data is available at your fingertips in the mabl user interface and some can even be easily accessed via our Google BigQuery integration.
Thinking Outside the Pass/Fail Box
Test data is a valuable resource for identifying issues in both the product itself as well as the software development pipeline. Unfortunately, few test automation solutions are designed to help quality teams harness that data for troubleshooting and broader improvements. When a testing solution ignores the potential of quality engineering and software testers, quality teams are limited in what they can contribute to improving customer satisfaction, building a better software product, and overall DevOps adoption. As this mabl-on-mabl case demonstrates, test data can be harnessed to help QA teams troubleshoot environments faster, which helps them quickly resolve issues and get back to developing new testing strategies, working with the rest of the development team, and leading the customer experience. Test automation platforms need to recognize this opportunity and help their end-users maximize the valuable insights provided by test data.
If your team is interested in the potential of test data for your organization, register for mabl's 14-day free trial. You’ll have full access to the mabl test automation platform, our award-winning support team, and our extensive library of support documentation.