Wednesday, January 16, 2013

The Unit Test Maturity Model

In software development (and other paradigms), there is the notion of a "maturity model", where level zero reflects complete ignorance of the metric being measured, followed by multiple levels of increased acknowledgement and performance of the desired metric.

The first of these is clearly the Capability Maturity Model which was applied to software and general business processes way back in the previous century.  An instance that is more relevant to this article would be the Testing Maturity Model, which applies to software testing in general.

It occurs to me that no one has thought of defining a maturity model for unit testing, separate from software testing in general. This article is an attempt to define this scale. I write this from the context of Java development, but these levels will easily apply to other programming languages.

Note that you could also consider other aspects of testing, like acceptance testing, which can be done in an automated fashion. This is very important, but it can and should be addressed separately from unit testing.

Level zero - Ignorance

The zeroth level of this model is represented by organizations that are either completely ignorant of the need for unit testing in their software or that believe that "it won't work here".

I've also seen a pattern at this level where managers refer to "unit testing" as a verb, not as a noun. This means that they have developers build the application and run it on the target platform to make sure that it behaves as they intended. The idea that tests could be run in an isolated fashion and run automatically has never occurred to them.

In my current organization, I'd say we were in this state barely a year ago.

Level 1 - Few Simple Tests

At this level, some people on the team understand the basic idea of a unit test and acknowledge it's probably a good thing, and some simple-minded unit tests are written and stored in source control, and that's all.

These tests are typically only used by the person who wrote the code being tested.  They are not consistently maintained, and they are not run or even compiled as part of the build, so it's impossible to tell what state they will be in at any time.

Level 2 - Mocks and Stubs

After level 1, there are several possible epiphanies people may discover, so the order of these may vary.  One possible discovery at level 2 is using mocks and stubs to isolate physical dependencies to make it possible to write a true unit test.

If the software being tested is structured somewhat reasonably (see other level descriptions), then layers that access physical dependencies would be isolated into separate dependent classes. This makes it easy to use mocking frameworks like Mockito or EasyMock to make it easier to write tests.  Sometimes PowerMock can be used when the code isn't quite as well structured. Javascript has frameworks like MockJax that can perform a similar function.

Level 3 - Design for Testability

This level simply refers to writing code that is easier to test, even if no tests are ever written for it. If this happens naturally before an organization even discovers unit tests, then they probably won't even notice PowerMock (although Mockito is still a no-brainer).

Teams at this level write code that has clear delineations of responsibility (See the Single Responsibility Principle), so it's easy to separate code to be tested from other code that is used to satisfy dependencies.

This is often associated with the use of dependency injection and inversion of control.  You can still get reasonable "design for testability" without either of these, but you'll often end up with lots of static utility classes and complicated constructor calls, which will often require using PowerMock to untangle things.

Level 4 - Test Driven Development

It's hard to say whether this level belongs at this point in the scale, or higher up, or if it's even necessary at all.

The basic idea of test-driven development is that instead of writing unit tests after you write the code, you start out writing the unit test, and evolve both the code being tested and the unit test at the same time. This sounds like a simple idea, but it results in a very different way of thinking about writing code. The key point to understand is that when you write code in a TDD-way, you end up with code that is structured differently from what you would get without TDD. using TDD subsumes the idea of "design for testability". You inherently get code that is testable when you utilize TDD.

Level 5 - Code Coverage

There comes a time in a team's history where a robust set of unit tests are written and maintained, and one day a big error is discovered in production.  A root cause is analyzed and it's found that the error would have been noticed with a simple unit test, but no one realized that the block of code that failed was never executed in any test.

The problem is that no one realized that the existing unit tests weren't testing all of their business logic, or at least not the parts that mattered.  This is what code coverage is for.

When unit tests are run with a code coverage framework in place, it will cause statistics to be automatically generated that measure exactly what lines of code are executed when a unit test is executed. These lines of code can be counted and compared to the total number of lines of code in the project, or interactive views can be generated that list the code in a project in a way to illustrate which code lines have been tested with unit tests, and which lines have not.

For Java, there are a few possible choices for frameworks to support code coverage:
The last one is a commercial product from Atlassian, and the others are free and open-source. There is an older framework called Emma which was the core of the EclEmma Eclipse plugin. Ironically, Emma is now considered obsolete, and even the EclEmma plugin uses JaCoCo under the covers, and is even maintained by the same team that maintains EclEmma.

With one or more of these frameworks in place, there is acknowledgement and understanding of the need to pay attention to what lines of code are being tested by unit tests.

Level 6 - Unit Tests in the Build

This is another step that may or may not have already been implemented, if organizations have achieved the previous levels.

If you're writing unit tests, and you're even determining the level of coverage of those tests, are all of the developers and managers certain that all of these tests are passing?  You can't be certain of this until the unit tests are run every time the source code is built. This requires adding a step to the required build process that compiles and runs the unit tests, and most importantly, requires all the unit tests to pass for the build to be marked as completed. If you don't have this, then management is not fully committed to unit tests.

This level is easy to achieve with Maven. If you have any unit tests at all in a project, Maven notices them, and if any of them fail, the build fails.  This is the default behavior, without adding any configuration at all.  If you're using Ant (what I like to call "build assembly language"), then you have to take manual steps to make this happen (just like anything else you have to do in Ant).

Level 7 - Code Coverage Feedback Loop

At this point, we're likely in a very good state. There's only one missing element related to code coverage, and with that some general code quality issues.

We're generating code coverage statistics, but only the developers writing unit tests are looking at the data, and likely only for the  unit tests that they have individually written. In addition, some projects are doing better than others at extending coverage, and there isn't enough awareness of the projects with deficient coverage.

The problem is that these statistics aren't being published anywhere so it's easy for managers or other developers to see. This is what you get with Sonar. With this in place, not only are code coverage statistics being published for easy viewing, it will present the history of increases or decreases in code coverage for individual or aggregate projects, to a potentially obsessive level of detail.

Sonar also provides other code quality metrics like FindBugs, PMD, and Checkstyle, just to name the more prominent ones.

Level 8 - Automated Builds and Tasks

Now that all of these features are in place, it would be a complete mess if all of these things had to be run manually.

What you want to see happen is that when a developer clicks a button to check in a set of files, an automated process notices a new checkin and starts a set of processes does something like the following:
  • Checks out a fresh copy of the source tree
  • Compiles the source code
  • Runs the unit tests, generating code coverage statistics
  • Runs a Sonar scan
  • Assembles deployable artifacts
  • Deploys artifacts to acceptance test servers
  • Runs automated acceptance tests
  • Publishes results from all automated deployment and test steps
This is what a continuous integration server is for.  There are several good choices for this, including Jenkins, Hudson, and Bamboo. With any of these in place,  setting up and maintaining all of these automated tasks is very straightforward.