Terrible Testing Case Studies
There’s nothing worse than failing testing efforts. You may have followed all the common testing “best practices,” and your efforts still did little to find critical issues with your application once it went live. Many testers follow a flawed model that leads to terrible testing.
Discover a better way with Todd Gardner as he shares some of his experiences and what he found worked best for him to create testing awesomeness. Listen as Todd tears down testing pyramids and rebuilds with a better-scaled approach.
Listen to the Audio
In this episode, you'll discover:
- Why the testing pyramid is misleading. Avoid the curse of the testing mummy.
- Tips to find out if your testing the right things
- What RISK has to do with testing
- Why monitoring in production is a must for catching modern application issues
Any sufficiently advanced, #automated #testing system is indistinguishable from monitoring~@toddhgardnerClick to tweet
Join the Conversation
My favorite part of doing these podcasts is participating in the conversations they provoke. Each week, I pull out one question that I like to get your thoughts on.
This week, it is this:
Question: How much does your team factor in Risk when planning your testing efforts? Share your answer in the comments below.
Want to Test Talk?
If you have a question, comment, thought or concern, you can do so by clicking here. I'd love to hear from you.
How to Get Promoted on the Show and Increase your Kama
Subscribe to the show in iTunes and give us a rating and review. Make sure you put your real name and website in the text of the review itself. We will definitely mention you on this show.
We are also on Stitcher.com so if you prefer Stitcher, please subscribe there.
Read the Full Transcript
Joe: Hey, Todd. Welcome to Test Talks.
Todd: Hey! Thanks for having me.
Joe: Awesome. Todd, before we get started, could you just tell us a little bit more about yourself?
Joe: Awesome. I actually did see you at the Oredev Conference in Sweden this year, and I really loved your presentation on terrible testing.
Todd: That's awesome. I love doing that “Talk.” It's basically an excuse for me to get up on stage and rant about all the things I think are wrong.
Joe: It was awesome, even your slides are really cool. I'm going to have a link to that within the show notes, but I really enjoyed it. That's why I want to just talk about some of the things I had thought about as I was watching that presentation.
I loved how you started your presentation by asking the question. You said you always ask the question, “Am I testing the right thing?” It seems like a simple question, but it seems like a question that a lot of people don't ask themselves, so, why do you think that's important to ask before you start just creating tests for your application?
Todd: Are you from Boston?
Joe: Yes. Can you tell? I'm actually from Rhode Island, so, it's a little worse than Boston.
Todd: It was just how you said, “started.” That's awesome.
The question of, “Are you testing the right thing?” I don't think is asked often enough, because it's really hard. It's probably one of the hardest problems in software development to answer, because the fundamental question inside of it is, “What is the risk of this thing that we're doing?” and that's a really contextual question about what your project is. There's all different kinds of risk that comes in with software development, because this is a incredibly creative and complex activity that we're doing, but we're also working in an incredibly complex market, or niche, that you're operating in, so there's all different things that can go work at so many levels. You have to think really critically about that.
For example, starting from the inside is a lot of times, when people talk about testing, developers immediately jump to the idea of a unit test. Unit tests are really good for addressing certain kinds of risk. If you have complex logic inside of your application, or just logic that is going to be changing frequently, unit tests are a good way to mitigate that risk.
As we get out, maybe it's how your system all comes together, that is where the risk is. Maybe you're relying on lots of different microservices or you're reaching out to third parties. Well, writing integration tests to exercise those connections can be a good way to mitigate that risk.
If we go out even further, ultimately, there will be consumers of your app, which will either be other systems or people, so writing system tests to simulate being those customers or those systems, and run full transactions, that's a good way to mitigate that risk, but then there's a whole other level of risk that is almost never talked about in development circles, and that's a market risk, or your customer risk.
This thing that you're doing, the overall project, does anybody care? Is this going to work? One of the examples that I give in my “Talk,” is about this giant project that I was a part of where we spent eighteen months at tremendous burn rates, tremendous numbers of developers working on this thing, for a project that was an ultimate failure. It was a failure, because nobody purchased the product. It ended up being shelfware.
How do you address those sort of risks by maybe building smaller, more focused pieces of software that, hey, maybe it's not as architecturally complete or it's not as “good” as your final version would be, but can you consider the software itself a test for whether or not this project is going to succeed? I think that's a really hard question to ask when you start at the beginning of the project. You have to sit back and everybody's super-excited to move forward, but you have to think about, “How is this thing going to blow up in our face? How can we try and reduce the risk of that happening?” Rather than just jumping into testing and development of the parts that we know, how do you start by testing the things that you don't know?
Joe: Awesome. The main takeaway I took from your Talk was “Risk.” I don't think a lot of people think of risk. They just do things just because it's supposedly a good practice, or a best practice. What I found interesting was, that example you talked about, I think you said they were burning $2,000,000 a month for eighteen months, or something, and they never asked their users, “Is this something you want?” It's kind of scary how, I think, a lot of companies might actually operate that way.
Todd: Giant companies that you would expect to be more cautious than that, but you get into a situation, especially with large companies, where they feel that they understand, like, of course this is going to work. Some person inside the company has bet their reputation on it, and they're well-respected, and they're like, “Yes! We're going to go do this thing! It's going to be awesome! The market's going to make us millions and millions of dollars, and we're going to win!”
Just because your market forecasts tell you that this is going to happen, and even if a customer comes to you and says, “Yes, I think that's a great idea, I'm going to buy that,” until they actually give you those dollars, you don't know that that succeeded until the users start using it. There's lots of different kinds of software. It's not always publicly commercial. Maybe you're building an internal app? In that case, until the user's actually using the product to create the value that you're trying to do, you don't know if this is going to work.
How do you build a piece of software that's as small as possible to generate some value and use that as a test for the larger project, before you invest all of the time and effort and heartache in agonizing about an architecture and a testing structure internally, maybe you could just write something quick and dirty? Maybe you could just pound out a quick Rails app, or an ASP.NET app, or whatever? Just get something small out for one customer and say, “Hey! We launched! Sign up!” If they don't sign up, well, what does that tell you?
Joe: It's almost like the whole lean startup concept, where you do a quick iteration, get it out there, get feedback, and then build on that, based on feedback?
Todd: Exactly. I was totally going with the whole lean startup at [MVP 06:59]. I was trying to avoid the term “MVP” a little bit, specifically in the “Talk,” I don't even say it at all, because I feel like that term is very overused, and I don't think a lot of people really understand what it means is minimum, in a lot of contexts, might be a lot. Minimum doesn't necessarily mean half-assed, and it doesn't mean always-small. It's thinking critically about your product and saying, “What is the smallest thing we could do to deliver value?” I think either people will deliver too little, or they'll dismiss it and say, “Of course we have to deliver the whole project in order to deliver value. That's what we've built the financial case for this thing on!” I think you have to spend a lot of time critically thinking about what subset can you deliver to some subset of user to see if you're right.
Joe: I definitely agree with you. I work for a large, enterprise company that writes medical applications. We have test sites at hospitals, where you have actual users that try the application as we're developing it. Like you said, it's not a “half-assed” solution. We're trying to do a real solution. It has to be, because it's FDA-regulated, but also we're trying to get that quicker feedback, so, that's a great point you bring up.
Todd: Yeah, so, for that kind of thing, where you do have a lot of regulations, you can't cut corners with health care and that sort of thing, but maybe you could build a solution that works for one kind of procedure for one hospital, and say, “Okay, we built it for that hospital. Does it work? Will they buy it? Will they give us some cash in exchange for this to prove that this is a thing?” Then, we'll figure out how to scale it up to 10,000 hospital, because you don't need to worry about the architecture to scale up to 10,000 until you know you actually have a market for 10,000.
Joe: Yep, so true. I think what you also talked about is the Testing Pyramid, and how you think it's not actually a great example of how to think about testing, per se, because, the main point that you talked about is, it's missing user tests. Can you just explain a little what you mean by, “Testing scales?” I really like that concept.
Todd: When I was first starting, and what a lot of people first read about when we start talking about software testing, is there's this concept called, “The Software Pyramid.” You have a lot of unit tests at the bottom of a Pyramid, so your system should have a lot of unit tests, because unit tests are cheap; a few integration tests, because they're a little bit more expensive, but they test a different part of your system, and a few system tests, and they design this thing and this funny, little diagram that looks like a pyramid, that I imagine you could link up in your [share 09:38] of notes, so that people can see that.
I think that that is a flawed model for two reasons. The first is that it misses this whole thing that we've been talking about: market risk. About whether or not the project itself is a good idea, and how do you test that? Which is really operating at a level above system tests.
The second problem that I have with the model is that it's implying volume. It's saying you should have more unit tests than integration tests, because unit tests are cheap to write, and I think that that's missing the point. It's not how cheap or expensive it is to write and maintain. It's what kind of risks are being addressed by those sort of tests.
A lot of web applications that I tend to write today are really trivial. I'm just trying to demonstrate a concept, or I'm just trying to build a CRUD App for some purpose, and I'm basically converting [SQL 10:42] into HTML and back. In those sort of apps, there's very little, functional complexity happening. Usually, I have a framework I've shoved in place that's handling, abstracting, all that stuff away. I don't really see a lot of risk in my functional logic at that level, so I tend to not write a lot of unit tests in that situation, because I don't see what it's getting me. If it works, it works, and if it doesn't work, it's not going to work. At that level, I don't have a whole lot of complexity there.
Where I do have some complexity, is typically in an integration level. “Can my app talk to my database?” and, “Am I writing queries correctly?” I'll write some integration tests around that. “Am I presenting a user interface that makes sense?” I'll write some system tests around that. Depending on the app, I might have completely inverted the Pyramid. I might have ten system tests, five integration tests, and one unit test for the one actual, interesting piece of logic I had had in an app, and that totally messes with that metaphor of a “pyramid.” That's why I think it's a “scale,” instead, and one of the things I talk about a lot in the “Talk” is, rather than prescribing how many of one kind of test you should have over another, you should look at, say, “What kind of risks exist in my system? How can my system break?” and build up the testing coverage that you need in each of those areas to address the risks of the system.
Joe: That was the beauty of your presentation. You actually worked on projects where they had, technically, 100% code coverage, and they still failed! How can that be true? How can you have 100% code coverage and the tests still fail?
Todd: I have worked on a couple of different projects where a certain level of code coverage was enforced. When you put a metric in place, saying, “We will have 80% code coverage,” or, “We will have 100% code coverage,” you're not actually saying anything about how much quality you want in your code base. Simply, how many lines of code must be executed as a result of running tests? That comes down to, it puts a force on the culture of your team. If you have a team that doesn't really believe in testing the same things that the management who put this requirement on you says, then you're just going to get a bunch of shortcuts.
In one particular example, I was part of a project that would do what I'd call, “Assertion-free testing,” where we would loop over the objects, and just call every function and bury the exceptions that come out of them, and we'd hit 100% code coverage. It was super-easy to hit 100% code coverage. It's just, it didn't test anything. The app was a total disaster.
On a more recent project, we actually had 100% code coverage mandate. It's really hard to get those last few percentage points. You end up testing a bunch of really stupid code that's like, “Of course this is going to work. I'm testing a property getter! Why would this ever fail? This is stupid.” But you'd have to do it in order to hit that arbitrary metric.
If you think about all of the different numbers, if you come onto a project and you ask, “What's the code coverage number that we're working for?” There's only two numbers that they can respond that tell you anything at all about the organization. If they tell you, “Zero. We have 0% code coverage,” that tests you one thing. That tells you that they're not testing. Doesn't necessarily tell you anything about their quality. It just tells you that there's some risk here, maybe it's high, maybe it's low, but they haven't addressed it, so we should talk about that more.
On the opposite end of the spectrum, if they tell you, “We have 100% code coverage!” All that really tells you is that they have an organization that has incentivized that metric. It doesn't actually, again, tell you anything about the quality of the code base, because they could be hacking in tests. They could be developing giant, mock objects that just totally subvert any real logic in the application. They could be doing all sorts of bad practices of their tests that could not tell you anything about the software itself.
Any other number in between tells you nothing at all for all of these same reasons. The code coverage is just a reflection of how big the tests are, not how good the tests are or what risks are being addressed, and I don't feel like there is any quantifiable metric to do that. Testing ultimately comes down to the commitment from the team, whether they think it's valuable or not, and what they're trying to address with those tests.
Joe: It's a great point. I don't know what I feel worse about, 0% code coverage or 100% code coverage, because I've worked with teams that show me a dashboard and go, “Look! We have 98% code coverage,” and I know those tests are useless, most of them. It's almost like they're giving themselves false confidence without actually thinking, “Are we actually testing the right things? The risky things?” like you said, and that's probably something I think is a mind-shift that a lot of people need to go through.
Todd: Right, because those tests aren't free. When you say, “I want to add a test to the system for X,” what you're trying to do is address the risk of something breaking, but you've also introduced cost and complexity. Anytime I want to change that part of the system, I have the tests to be confident, but I also have to keep this test up to date. What I found is that, when you get into teams that have arbitrary high levels of code coverage, you've just created a ton of busywork. You've created a ton of resistance to making further productivity, because the tests, once you get up that high, they probably aren't giving you that much more material confidence in your system, but they are creating a lot of resistance to making change. You make what you feel is a trivial change to a code base, and six tests break because you violated their mock expectations now. That's just so frustrating, to work in those sort of environments.
Joe: Another thing you brought up, and I didn't look it up yet, was, “Test-induced design damage.” In a nutshell, what is test design damage? Is this due to over-testing, or using too many mocks for something that's really not testing the real application?
Todd: This was a conversation that came from DHH. For those not familiar, “DHH” is one of the co-founders of a company called “37signals,” and the creator of “Rails,” the popular web application framework. In 2014, I believe, at RailsConf, he gave a keynote, and a blog post that came out at the same time, called, “Test-Induced Design Damage.” It was a little provocative, it got a bunch of people kind of worked up. It didn't fully address the topic, but the point he was trying to make is largely that he doesn't feel that it is necessary to change his code to accommodate the tests. When he can't test something because of how it's written, he's not willing to change it just to write a test for it, because it muddles the design of the code that he thinks it should be put in. The point he was trying to make is that, people who advocate for TDD and heavy unit-testing structures are damaging their internal code to make it fit within that mindset.
There was tons of back-and-forth and a large, online debate that happened during this time, but what came out of it was this video series that ThoughtWorks put together. It's all out on YouTube. I think there's three or four half-hour parts, and it's DHH, Martin Fowler, who's from ThoughtWorks and who's written tons of high-level books, and Kent Beck, who's the originator of XP and tester of [in 18:41] development, a proponent of all those ideas. The three of them got on a video call together and debated this concept, and I think those debates are fantastic! They illustrate Kent Beck's motivations behind it, how he's shooting for confidence, and he's trying to address the fear of changing his systems. Martin Fowler brings up the idea of classical testing, or “Classical” TDD versus “Mockist” TDD, and the pros and cons of using mock objects, and I think DHH has a really powerful concept that it doesn't matter what your code looks like and what your testing structure looks like. These are just interim things. What matters is the software we ship, the value that's created at the end of the day.
The perspective of his that I really liked was that, ultimately, he shipped Rails, and he shipped Basecamp, and he shipped a ton of really valuable products that you can't argue that these are bad pieces of software because they don't follow TDD and they don't have these testable architectural patterns in them. They're valuable because, literally, millions and millions of people use these pieces of software every day, so I think there's some power to his argument there, that maybe these things don't matter quite so much as we think they do.
Joe: Maybe I'm wrong, I didn't get away thinking you're saying testing is stupid, don't test. It's just, test the right thing. You want to test smart, not just test for the sake of testing, I would think.
Todd: Right, I'm definitely not saying that, and I don't think DHH was saying that, either, because there's definitely cases where you have some scary piece of logic somewhere in your code that tends to break and tends to be a source of bugs. You should have absolutely have tests around that to make sure that you understand how it works and you are confident to make changes in how it works.
The problem that I think DHH was bringing up, and I'm certainly bringing up, is that applying the ideas blanketly across your app, that we'd have to TDD everything, and we'd have to have unit tests on everything, is a waste of time, because there's certain parts of your app that are trivial to understand and are not going to break. There are parts of your system that is well understood and you heavily rely on, the underlying frameworks or languages or runtimes that you're operating in, I just don't need the test coverage, but I think that that's up to each team and each application to figure out and make the decisions on those bits. What are the parts of your code that are risky, that concern you, that tend to break? Have good tests around that, that actually address those risks and give you the confidence. If the part doesn't scare you, if anybody on your team can look at a piece of code and be like, “Duh, of course that's what it does,” and it doesn't break, and this just works, why are you wasting your time writing tests for that? Move on to the next thing.
Joe: I think a lot of people have heard about shift left, but I've been hearing more about shifting right where, and I think this kind of goes into one of your points, maybe I'm wrong, but that a lot of architectures like microservices, you don't necessarily know how your user is going to interact with your services. How they're going to use them, per se. The only way really to get that, is to actually almost test in production, or monitor in production, and I've been hearing more and more about predictive analysis, where you actually monitor in production, and you're able to rollback based on predictive analysis. You think these users are going to start having issues, but before they know there's an issue, you've been monitoring it, so you roll it back.
It's almost like performance testing. You never can really performance test correct unless you're in the productive environment. Some of these things, you can't really test anyway, so that maybe we need to invest a little bit more in our after-production. After we release the code, it's not over. We [now-to-then 22:45] also start including some sort of monitoring solutions for these things. Do you agree with that? Is that something you've been seeing more and more of?
Todd: I absolutely agree with that. I can't remember where I heard this quote, and I'd love to attribute it to somebody, but I'm just going to repeat it anyway and say that it wasn't me but it's really clever. “If you have a sufficiently, any sufficiently advanced, automated testing system is indistinguishable from monitoring.” If you get to a really advanced set of tests, where you are exercising real code [pass 23:20] on real systems and you can predict when things are going to fail … well, that's kind of the same thing as testing it all the time in production, which is monitoring. I think, in a lot of ways, it's better than testing. You can write tests for the parts of the system that you think are going to fail, that you can conceive, “Hey, I think this can break. I'm going to write some tests around it,” and the time to write the tests and maintain the tests and run the tests fits within our idea of the budgets and the time-constraints of the system.
You can never test everything. It doesn't make financial sense to test everything. There's always some parts of your system that you can't adequately test for, nor should you, because it's just not financially worth it. That's where monitoring comes in.
When you put your system out into production, you should put proper monitors around what's the user doing, what's the system doing, how is it acting? So that when a user runs into trouble, you can react to it very quickly. When you put your system into production, and the user starts interacting with it, it can fail in all kinds of ways that you can't predict, so monitoring your application at all the different levels lets you get a better understand of how your system really breaks. If you built your system in such a way that you can react to those, if you emphasis monitoring as an important part of your system up front, you can react to those monitoring things and either rollback your system, if it's a result of a current change, or quickly patch a bug and put it out.
As a real life use case of that, with my own system that I build alongside with my team at TrackJS, we don't test everything. We test the parts that we think can break, but there's always parts that we missed, but we heavily monitor our own system. When a customer runs into a problem, we treat that as such a high priority that when an exception happens in our system, we don't just dump that into a log somewhere. We dump it into our active chat channel that the entire team sits in. It's like, “A customer just ran into this problem. Type this exception into the system or into our main log.” Which can sometimes be noisy; however, we try to address and smash down those issues as fast as we can.
We find out a customer had a problem, we dig into the issue, if it's a false-positive, we add the appropriate rules to not throw those kinds of exceptions and push out that change. If it is a real issue, we reach out to that customer, typically within an hour, apologize that this issue happened, explain what's going to go on, and we can typically get out a patch for that problem within six hours. Built, tested it, and released to production.
Because we've emphasized those parts of our system, we're able to do those fast turnarounds, and because our target audience, our developers, they tend to respect that fast turnaround, and we get tremendous customer loyalty from the people who have experienced bugs in our system because we gave them that high level of service. We responded right away, we apologized right away, and we fixed and we showed up how fast and responsive we are as a company. They really appreciated that. In a way, building a system that is fast to repair is almost better than having a system that doesn't break.
Joe: That's a good point, and you actually practice what you preach. You have a real application that you're actually doing this on. I think you have a really good experience with it and it sounds like it's working for you. I don't know if it'll work for everyone, but it sounds like, like you said, it depends on risk and how much risk you're willing to take with your applications, depending on what they're doing.
Todd: Absolutely. For your case, you're working in healthcare. You have a whole different risk profile than what I'm doing. You can't necessarily have the same level of open-risk when you go to production that I feel comfortable with. We have different parts of our system, and each part of our system has different levels of risk, but for our UI, I'm comfortable with a little bit more risk in that, because the worst case scenario is somebody might have a clunky experience in the UI. We record the error and we can patch it really fast, but no data is lost, nobody is harmed, and so we're comfortable with a higher level of risk there.
We have a different part of our app that our customers actually include in their applications, and I'm not as comfortable with a high level of risk there, and we have a far more robust testing structure in that part of our application. For a healthcare app, if you are, say, building a healthcare device, I would want you to have a very low tolerance for risk. I don't want you shipping a version of a pacemaker that has a high level of risk in it. I want you to be really sure that that works.
The overarching point of my Talk and my message is that it depends. I'm not saying, “Hey, Todd said you don't need to test! Go ahead and push it into production – it's fine!” That's totally not what I'm saying. What I'm saying is, that you need to think critically about your application and your tolerance for risk, and build a testing structure and a monitoring structure that accommodates what you're willing to do as cheaply as possible.
Joe: I know we mentioned TrackJS a few times. What is TrackJS?
The big thing that we think is different about our service is the telemetry timeline. We track a bunch of things that are going on in the application before the error happens, so we know the messages that are being logged in the console. How your application state changes. I know what Ajax events are happening. How is the client communicating with the server? I know what the end-user is doing. I know what are they clicking on, or what input fields are they interacting with, so that when an exception happens that we need to report, I give you a bunch of information about that error, but I also tell you how did the user get to this error. They get to this error because an IE9 user who clicked on this page and entered an email address in this input and clicked on this button caused this error. We feel like our error reports give a tremendous amount of information back to the web developers or operations teams to know how is the web application failing in the hands of real users, so you can respond to them and fix those problems fast.
Joe: Awesome, sounds really cool. I haven't seen it yet, but I'm definitely going to check it out. Sounds like a really great piece of technology to especially help you, like you said, to catch those things that you probably couldn't even test if you tried. It's something that's actually, you're augmenting, almost, your coverage, using this type of tool.
You should check out TrackJS. We offer a free thirty day trial, so you can put in your app and see if it works for you. You can find that at TrackJS.com. You can contact me. I'm Todd@TrackJS.com. Feel free to email me with questions or calls of heresy, or telling me I'm terrible and I'm wrong and I should be TDDing everything, or you can argue with me on Twitter. I'm @toddhgardner.