Tweagers have an engineering mantra — Functional. Typed. Immutable. — that begets composable software which can be reasoned about and avails itself to static analysis. These are all “good things” for building robust software, which inevitably lead us to using languages such as Haskell, OCaml and Rust. However, it would be remiss of us to snub languages that don’t enforce the same disciplines, but are nonetheless popular choices in industry. Ivory towers are lonely places, after all.
Last time I wrote about how we can use Python’s1 abstract base classes to express useful concepts and primitives that are common in functional programming languages. In this final episode, I’ll cover testing strategies that can be learnt from functional programming and applied to your Python code.
It’s hardly a revelation that testing the code we write — and automating that process — is an essential element of software engineering practice. Besides the headline case of catching bugs and regressions, a well-written test suite can exemplify how a codebase is intended to be used and can even guide how code is structured for the better.2
“Well-written”, however, is the operative word. One can easily fall into the folly of optimising for coverage, say, rather than focusing on what matters, such as the expected behaviour of the system under test or how its parts interface with each other. Dogmatic adherents to various methodologies can waste time ticking boxes, while bugs slip through the cracks. In a dynamically typed language, like Python, it can be tempting to just throw tests together and call it a day!
How, besides discipline, can we do better? Code often contains a tantalising amount of metadata, such as type signatures. Can we leverage that to facilitate testing; to reduce the cognitive burden on the engineer? Is it possible for the machine to find edge cases — to improve coverage and resiliency — rather than relying on our inherently fallible selves?
As you may suspect, all these questions have been answered with a resounding “Yes!” in functional programming ecosystems. So let’s look at what we can do in Python.
It’s common to see tests that use hard-coded examples, asserting against an expected output. These examples are often cherry-picked to elicit the (presumed) behaviour of the system under test. Better still is if the examples are engineered to trigger edge cases and failure modes.
The problem here is that the input space could be vast and the combinatorial effect will quickly overcome a human’s ability to reliably identify exemplars. Moreover, the relationship between the input and the expected output is, at best, only implicitly expressed; subtleties are easily lost.
Take, for example, sorting: A selection of inputs and circumstantially sorted outputs only gives anecdotal evidence of what’s being tested. The tests may be backed up by comments or the name of the test itself, but is this really sufficient? Say we’re testing a stable sort; while the before-and-after states will infer its correctness under scrutiny, this is not going to be immediately obvious to a casual reader.
Property-based testing (PBT) turns this around by having you define that relationship (i.e., the “properties” of the system) and generating the examples for you. That generation is particularly clever, where most PBT libraries will steer the examples towards common minimal failure states; a process known as “shrinking”.3 Your test suite is thus transformed into a specification of the system under test, which is a far stronger guarantee that it’s doing what it’s supposed to, while simultaneously documenting those expectations.
Back to our stable sort example, the properties might be that, given an input list:
- Each element in the input is accounted for in the output; no more, no fewer.
- Each element (after the first) in the output should be greater or equal to the one it preceded, however element comparison is defined.
- The relative order of equal elements in the input must be preserved in the output.
If you’ve ever learnt Haskell, you may have been introduced to QuickCheck — which popularised PBT — often to entice you with what can be achieved. In the 20+ intervening years, PBT libraries now exist for many programming languages; Python is no different, with Hypothesis as its de facto PBT offering. So let’s stop navel gazing and get our hands dirty!
In our previous discussions, we defined a
and its monoid. We demonstrated, in the Python REPL,
that, say, concatenation appears to behave as we expect, but as stated
above, this is not a rigorous test. Instead, let’s look at what
properties we’d expect from list concatenation and test those instead.
The obvious ones might be:
We should start by reassuring ourselves that our “so called monoid” actually obeys the monoid rules: associativity and the existence of an identity element.
As a gate-keeping sanity check, the length of the concatenated list () must be equal to the sum of the lengths of its inputs ( and , respectively): .
The (0-indexed) th element of should equal:
the th element of , when ;
the th element of , when .
We can quite easily convert these properties into code:4
from typing import TypeVar T = TypeVar("T") def test_monoid(a: List[T], b: List[T], c: List[T]): # Associativity: (a * b) * c == a * (b * c) assert (a.mappend(b)).mappend(c) == a.mappend(b.mappend(c)) # Identity: e * a == a == a * e identity = List.mempty() assert identity.mappend(a) == a == a.mappend(identity) def test_concatenation(a: List[T], b: List[T]): concatenated = a.mappend(b) assert len(concatenated) == len(a) + len(b) for i in range(len(concatenated)): if i < len(a): assert concatenated[i] == a[i] else: assert concatenated[i] == b[i - len(a)]
I hope you’d agree that these tests are clearly defining the expected behaviour. While the entire input space won’t be used for our monoid rule test — it’s a test, not a proof — having many generated examples is surely better than a handful. The question now becomes, “Where do the generated inputs come from?”
Enter Hypothesis, to help us out.
Like many other PBT libraries, Hypothesis provides an extensive suite — which it calls “strategies” — of primitive value generators, including from type introspection, as well as combinators which allow you to build more complex value generators. These can then be utilised by Hypothesis in a test harness, which generates the input values under test.
These candidate values are pseudorandom and the quantity configurable. The more examples that are generated, the greater confidence you have that your code correctly satisfies the properties you’ve defined; albeit at the expense of test runtime.5 That said, Hypothesis keeps state, so falsifying examples will be retried when they’re found, and it will focus its search around common failure states.
Let’s demonstrate with a simple example. Consider the property of real numbers being positive when squared. We can formulate this as a Hypothesis test6 as:
from hypothesis import given from hypothesis.strategies import floats # NOTE IEEE-754 floats encode a NaN value, which we want to avoid @given(floats(allow_nan=False)) def test_square(v: float): assert v * v > 0
Oops! Running this will fail, because we got our property wrong. The squares of real numbers aren’t strictly positive; they are non-negative. Hypothesis rightfully complains and shows us where things went wrong:
AssertionError Falsifying example: test_square( v=0.0, )
So our task, to test the properties of our list monoid, is to write a
strategy, using Hypothesis’ primitives, to generate
Lists.7 To keep things simple, we’ll generate
Lists of integers — the type of the elements shouldn’t affect
concatenation, after all — and we won’t bother parametrising size
limits, like the Hypothesis
lists strategy does.
So, as a first approximation:
from hypothesis import strategies as st @st.composite def lists(draw: st.DrawFn) -> List[int]: # Draw an integer element, or None to signal the list terminator element = draw(st.one_of(st.none(), st.integers())) match element: case None: return Nil() case _: # Recursively draw from this strategy to extend the list return Cons(element, draw(lists()))
The magic is in the
composite decorator and the
draw function it
provides. It’s a very clean API that makes
writing complex strategies straightforward.
We can now use this strategy with the property test functions we wrote earlier:
@given(lists(), lists(), lists()) def test_monoid(a: List[int], b: List[int], c: List[int]): ... @given(lists(), lists()) def test_concatenation(a: List[int], b: List[int]): ...
Lo and behold, it works!
To reassure ourselves, let’s try to break it by purposely introducing a
bug in the
List.mappend implementation, keeping our properties as
previously defined. For example, we can swap the input lists around in
the concatenation easily:
def mappend(self, rhs: List[Monoid[T]]) -> List[Monoid[T]]: # Should be: foldr(Cons, rhs, self) return foldr(Cons, self, rhs)
Immediately, Hypothesis complains. The monoid rules still hold, but the concatenation is obviously incorrect:
AssertionError: Cons(0, Nil()) * Cons(1, Nil()) != Cons(1, Cons(0, Nil())) Falsifying example: test_append( a=Cons(0, Nil()), b=Cons(1, Nil()), )
At this point we should feel pretty smug with
You’d be forgiven for thinking of PBT as a panacea — and let’s be honest: in many ways it is — however, it is not without its practical shortcomings. In particular, despite being more resilient than its human counterparts, a machine is still limited by the combinatorial explosion problem I mentioned earlier.
It is so easy to write strategies, that one can get carried away. If your domain values are particularly complicated or deeply nested, Hypothesis (or any PBT library) will take an increasing amount of time to compute examples; to the point that testing becomes intractable. Decomposing the problem — another instance of a testing strategy influencing how code is structured — can help alleviate this, but it’s not always possible.
However, of course, using PBT in your project is not some kind of Faustian bargain. It is a tool — a particularly powerful one — that ought to be wielded liberally, but wisely. It wouldn’t make sense, for example, to agonise over a suite of complex strategies to simulate coherent state for an integration test, when rigging up that integration test is ultimately going to be cheaper.
We should also take care not to decry example testing outright. While PBT is the more robust technique, throwing in a handful of examples can provide stronger assurances against known failure states, which we cannot guarantee will be generated by the PBT library. Examples also have illustrative power, which greatly benefits the reader when it comes to maintenance. Implementing these as unit tests, rather than in documentation, also ensures their assertions won’t drift from the truth.
Complementary to PBT is the concept of “design by contract”, which was originally developed for Eiffel, an object-orientated programming language. Contracts, much like their real-world analogue, can be used to enforce the interface of, say, a function in terms of its expectations (pre-conditions), guarantees (post-conditions) and the state which the function maintains (invariants). Ordinarily, these are checked at runtime — during development — where any violations will cause the program to crash.
There are a few ways to specify contracts in Python:
- Informally, with
- Using a library like Deal or icontract;
- There’s even a long-dormant proposal to standardise them.
The fine-print of contracts is by the bye and runtime checking is definitely not what this series is about. So why do I mention it? It turns out there are tools that can perform static analysis against contracts. One such tool, in the Python ecosystem, is CrossHair.8
Whereas Hypothesis will generate pseudorandom inputs and shrink them to
common failure states, CrossHair works by analysing the execution paths
of any code under contract with an SMT solver to find
falsifying examples. It understands various kinds of contracts,
including those outlined above as well as pre-conditions specified by
CrossHair is still alpha software. As of writing, it appears to handle
List monoid, with the Hypothesis pre-conditions that we defined
above. (Although it starts to misbehave when we introduce a bug.) That
said, it’s a promising development in this space that’s worth keeping an
Testing can be a chore. A necessary evil. Writing good tests, which faithfully and clearly represent the logic of your codebase, is harder still. PBT is like a super-power that inverts the testing burden from hard-coding examples — which, although beginner-friendly, is highly limited — into thinking more deeply about what your code should be doing, leaving the machine to do the leg work. It’s almost addictive!
Throughout this series, I’ve had one goal in mind: to show that techniques from functional programming can be productively applied to “the second best language for everything”. PBT is yet another example of this and Hypothesis is a mature Python library to enable this super-power; of which, I’ve barely scratched the surface. Nonetheless, despite my introductory tour, I hope this and a glimpse of things to come has convinced you to give it — and the other techniques explored in this series — a try.
Thanks to Julien Debon, Guillaume Desforges, Johan Herland, Mark Karpov and Vince Reuter for their reviews of this article.
- We are not limited to Python; these techniques can be applied in any language with suitable support, libraries and tooling.↩
- The value of dependency inversion, for example, is not completely obvious until you start writing tests. At which point, you’ll wish you’d done it sooner!↩
- My colleague, Julien Debon, wrote a great article on shrinking strategies in OCaml’s port of QuickCheck. (Julien also takes the credit for introducing me to PBT proper.)↩
- For the sake of the examples, we assume that our
Listtype is equatable, indexable and sized (i.e., implements
__len__, respectively) and its element type is equatable. This is left as an exercise for the reader.↩
- For local development, it can be useful to set the number of generated examples fairly low, to keep the development loop tight. However, it makes sense to increase that significantly in the CI/CD environment to increase assurance.↩
- “Hypothesis” is a slightly annoying name for a testing library, as searches for “Hypothesis test” inevitably return results about statistical hypothesis testing! This is one of life’s many trials.↩
- Because we use generics, Hypothesis is not able to derive the
generator for our
Listfrom type annotations alone. If we lose the class hierarchy that simulates ADTs and stick with concrete element types, then
from_typecan work against
Cons. That’s not a good trade-off, when writing a generator by hand is so straightforward.↩
- My colleague, Conner Baker, first brought CrossHair to my attention. This article was just going to be about PBT, but even at this early stage, this tool is too cool not to briefly mention!↩