BDD / TDD criticized

peerreynders · September 10, 2019, 6:54pm

While flying to the moon a rocket throws off parts that are no more needed, but they were necessary to arrive at the moon. The dumping was necessary also.

And with the Space Shuttle the SRBs were not discarded. While they never themselves made it into orbit they were always an integral part of deploying the shuttle into orbit.

A nice extra is that e2e tests can be used with legacy code that is not neat.

I was once pulled into a project where the maintenance team of a legacy application hit a wall. Real world usage kept uncovering edge cases that the e2e testing didn’t account for. But even after adding additional tests they still had a devil of a time to track down the root cause of these defects. Typically the defect was detected so far downstream that they often got “lost” when they tried backtrack to the root cause with the debugger.

Unsurprisingly the application was a Big Ball of Mud so I attacked the only boundary that I could identify - between the application and the database.

It’s easy enough to simply trace that type of data at the driver level but typically you can’t arbitrarily instrument the data to identify the full context of that data. So I pulled all the DB code into a separate layer - in such a way that a shim-layer could be inserted in the development environment. That shim layer could then be augmented to log and instrument all the data that is moving back and forth between both sides.

Once that was in place it was much easier to reason out what the application was actually doing. The team could run a passing test and see what “healthy data traffic” looked like. Based on that they could formulate a working hypothesis of what “healthy data traffic” should look like for a new, failing test. When they encountered a deviation, they investigated typically with one of two outcomes:

Their hypothesis was flawed. They adjusted their hypothesis and carried on examining the data traffic.
Their hypothesis was correct and lead them straight to the root cause of the defect.

In the end this tactic allowed them to move forward when (e2e + debugger) failed them.

I came to the conclusion that often (e2e + debugger) simply doesn’t provide enough “developmental observability”.

Michael Feathers responded to Why Most Unit Testing is a Waste:

https://twitter.com/mfeathers/status/441598005515669504?lang=en

The Flawed Theory Behind Unit Testing (2008)

Unsurprisingly none of this convinced Coplien but the article makes some interesting points. First:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. … It’s a nice theory, but it’s wrong.

So not only is there rampant confusion about what exactly a “unit” is, as far as Feathers is concerned “testing” isn’t the primary objective of “unit testing”. He practices it to improve code quality - showing correctness is a mere side effect.

Code quality is a somewhat nebulous and subjective concept. There are software metrics to measure code characteristics (e.g. cyclomatic complexity) and some people establish limits on these metrics in order to narrow down what “code quality” actually is but the whole business has an air of educated arbitrariness about it.

But there is a hint in the preceding paragraph about the code quality that Feathers is after:

The problem that I saw with the mock object approach was that it only tested individual classes, not their interactions. Sure, the tests I wrote were nominally unit tests, but I liked the fact that they occasionally tested real interactions between a class and its immediate collaborators. Yes, I liked isolation but I felt that this little tiptoe into integration level testing gave my tests a bit more power and a bit more strength.

For me this is my second encounter of a “high profile” Detroit School (Classical) TDD practitioner expressing some skepticism about the London School (Mockist) TDD use of mock objects. The first was Martin Fowler:

But in the end they play nice and don’t press the point. But I think that was a mistake.

For example a good talk about keeping the volume of unit tests under control is Sandi Metz’s The Magic Tricks of Testing. But from Feathers’s perspective the problem is already apparent in the title - talking about unit tests and testing - not code quality.

Recently you posted about Mark Seemann’s work (thank You for that - I think anybody who has finished watching Scott Wlaschin’s talks needs to continue with Mark Seemann’s talks).

In Functional architecture - The pits of success he credits Jessica Kerr with the concept of isolation:

Inspired by Haskell, Mark Seemann pushes isolation to purity when he talks about Dependecy Rejection. I’ve seen the idea of the impure-pure-impure sandwich somewhere before though I can’t recall where; it’s an interesting approach though I’m not sure how it would fare in large scale software development. There isn’t a lot of commercial Haskell software development and imperative programming is still possible in Haskell due to the monadic do syntax.

However the extreme of purity is still instructive to demonstrate what isolation is after. Mutability by default which is typical of OO undermines purity as anything can retain state. But isolation does severely constrain what can influence a “unit’s” behaviour.

I suspect that the code quality that Feathers is after is isolation and the units he works with are units of isolation. Units of isolation could be a single class but often are a collection of collaborating classes. So there is no directive to write unit tests for every single class.

If so, unit tests demarcate the boundary of the unit of isolation and the assertions monitor the health of that unit while development progresses.

With classical unit testing there is significant design pressure to optimize the boundaries of the unit of isolation - otherwise the developer would have to write too much code for test doubles.

Mockist unit testing significantly reduced the penalty for needing test doubles and therefore removed the pain from assuming that the class boundary is the boundary of the unit - it became much easier to stop thinking (find out where the boundary should actually be) and lose the primary benefit (code quality) of unit testing.

That doesn’t mean mocking libraries/frameworks are bad. Mocking is essential for dealing with boundaries that you have no control over. But there should always be a very good justification for mocking your own code. And there is lots of material about mocking as a design smell (When is it safe to introduce test doubles?).

Classical TDD attempts to discover the appropriate units of isolation which in turn should lead to better code (as in easier to change later).

These days when unit testing is taught, mocking libraries like sinon.js are introduced before one gets to the point of thinking:

How can I organize this code so that I don’t have to create so many test doubles?

So if one accepts a predominant practice of degenerate unit testing which uses mocks heavily, and focuses on “testing”, rather than code quality as manifested by appropriately isolated functional capablities then these reports of the failure of unit testing aren’t surprising.