This is a post about a new automated testing technique we have recently adopted to help us during our daily development work on Prosody. It’s probably most interesting to developers, but anyone technically-inclined should be able to follow along!
If you’re unfamiliar with our project, it’s an open-source real-time messaging server, built around the XMPP protocol. It’s used by many organizations and self-hosting hobbyists, and also powers applications such as Snikket, JMP.chat and Jitsi Meet.
Like most software projects, we routinely use automated testing tools to ensure Prosody is behaving correctly, even as we continue to work daily on fixes and improvements throughout the project.
We use unit tests, which test the individual modules that Prosody is built from, via the busted testing tool for Lua. We also developed scansion, an automated XMPP client, for our integration tests that ensure Prosody as a whole is functioning as expected at the XMPP level.
Recently we’ve been experimenting with a new testing technique.
Introducing ‘mutation testing’
Mutation testing is a way to test the tests. It is an automated process that introduces intentional errors (known as “mutations”) into the source code, and then runs the tests after each possible mutation, to make sure they identify the error and fail.
Example mutations are things like changing
If the program was originally correct, then these changes should make it
incorrect and the tests should fail. However, if the tests were not extensive
enough, they might not notice the change and continue to report that the code
is working correctly. That’s when there is work to do!
Mutation testing is similar and related to other testing methods such as fault injection, which intentionally introduce errors into an application at runtime to ensure it handles them correctly. Mutation testing is specifically about errors introduced by modifying the application source code in certain ways. For this reason it is applicable to any code written in a given language, and does not need to be aware of any application-specific APIs or the runtime environment.
One end result of a full mutation testing analysis is a “mutation score”, which is simply the percentage of mutated versions of the program (“mutants”) that the test suite failed to identify. Along with coverage (which counts the percentage of lines successfully executed during a test run), the mutation score provides a way to measure the quality of a test suite.
Code coverage is not enough
Measuring coverage alone does not suffice to assess the quality of a test suite. Take this example function:
function max(a, b, c) if a > b or a > c then return a elseif b > a or b > c then return b elseif c > a or c > b then return c end end
This (not necessarily correct) function returns the largest of three input values. The lazy (fictional!) developer who wrote it was asked to ensure 100% test coverage for this function, here is the set of tests they produced:
assert(max(10, 0, 0) == 10) -- test case 1, a is greater assert(max(0, 10, 0) == 10) -- test case 2, b is greater assert(max(0, 0, 10) == 10) -- test case 3, c is greater
Like most tests, it executes the function with various input values and ensures it returns the expected result. In this case, the developer moves the maximum value ‘10’ between the three input parameters and successfully exercises every line of the function, achieving 100% code coverage. Mission accomplished!
But wait… is this really a comprehensive test suite? How can we judge how extensively the behaviour of this function is actually being tested?
Running this function through a mutation testing tool will highlight behaviour that the developer forgot to test. So that’s exactly what I did.
The tool generated 5 mutants, and the tests failed to catch 4 of them. This means the test suite only has a mutation score of 20%. This is a very low score, and despite the 100% line and branch coverage of the tests, we now have a strong indication that they are inadequate.
To fix this, we next have to analyze the mutants that our tests considered acceptable. Here is mutant number one:
function max(a, b, c) if false and a > b or a > c then return a elseif b > a or b > c then return b elseif c > a or c > b then return c end end
See what it did? It changed the first
if a > b to
if false and a > b,
effectively ensuring the condition
a > b will never be checked. A condition
was entirely disabled, yet the tests continued to pass?! There are two
possible reasons for this: either this condition is not really needed for the
program to work correctly, or we just don’t have any tests verifying that
this condition is doing its job.
Which test case should have tested this path? Obviously ‘test case 1’:
assert(max(10, 0, 0) == 10)
a is the greatest input here, and indeed the test confirms that the function
returns it correctly. But according to our mutation testing, this is happening
even without the
a > b check, and that seems wrong - we would only want to
a if it is also greater than
b. So let’s add a test for the case
a is greater than
c but not greater than
assert(max(10, 15, 0) == 15)
What a surprise, our new test fails:
Failure → spec/max_spec.lua @ 4 max produces the expected results spec/max_spec.lua:1: Expected objects to be equal. Passed in: (number) 10 Expected: (number) 15
With this new test case added, the mutant we looked at will no longer be passed, and we’ve successfully improved our mutation score.
Mutation testing helped us discover that our tests were not complete, despite having 100% coverage, and helped us identify which test cases we had forgotten to write. We can now go and fix our code to make the new test case pass, resulting in better tests and more confidence in the correctness of our code.
Mutation testing limitations
As a new tool in our toolbox, mutation testing has already helped us improve lots of our unit tests in ways we didn’t previously know they were lacking, and we’re focusing especially on improving our tests that currently have a low mutation score. But before you get too excited, you should be aware that although it is an amazing tool to have, it is not entirely perfect.
Probably the biggest problem with mutation testing, as anyone who tries it will soon discover, is what are called ‘equivalent mutants’. These are mutated versions of the source code that still behave correctly. Unfortunately, identifying whether mutants are equivalent to the original code often requires manual inspection by a developer.
Equivalent mutants are common where there are performance optimizations in the
code but the code still works correctly without them. There are other cases
too, such as when code only deals with whether a number is positive or
negative (the mutation tool might change
-2 and expect the tests to
fail). There are also APIs where modifying parameters will not change the
result. A common example of this in Prosody’s code is Lua’s
where indices outside the boundaries of the input string do not affect the
string.sub("test", 1, 4) and
string.sub("test", 1, 5) are
equivalent because the string is only 4 characters long).
Although mutation testing is something I first read about many years ago and it immediately interested me, there were no mutation testing tools available for Lua source code at the time. As this is the language I spend most of my time in while working on Prosody, I’ve never been able to properly use the technique.
However, for our new authorization API in Prosody, I’m currently adding more new code and tests than usual and the new code is security-related. I want to be sure that everything I add is covered well by the accompanying tests, and that sparked again my interest in mutation testing to support this effort.
Still no tool was available for Lua, so I set aside a couple of hours to determine whether producing such a thing would be feasible. Luckily I didn’t need to start from scratch - there is already a mature project for parsing and modifying Lua source code called ltokenp written by Luiz Henrique de Figueiredo. On top of this I needed to write a small filter script to actually define the mutations, and a helper script for the testing tool we use (busted) to actually inject the mutated source code during test runs.
Combining this all together, I wrote a simple shell script to wrap the process of generating the mutants, running the tests, and keeping score. The result is a single-file script that I’ve committed to the Prosody repository, and we will probably link it up to our CI in the future.
It’s still very young, and there are many improvements that could be made, but it is already proving very useful to us. If there is sufficient interest, maybe it will graduate into its own project some day!
If you’re interested in learning more about mutation testing, check out these resources:
Prosody is a lightweight and flexible XMPP server designed with ease-of-use and extensibility in mind.