#196 How to Fix Story Point Estimation

Richard Lawrence: Story points. They're used by agile teams everywhere, and they're frustrating for a lot of team members. There's one important way to make story points an easy and accurate way to estimate, but most teams don't do it that way, and points just end up becoming this weird agile unit for time.   Peter Green: If you caught our last episode on Reference Class Forecasting, and you've been around agile software development for any length of time, you probably noticed the connection between story points or planning poker and Reference Class Forecasting.   Richard: In this episode, we're gonna look at how story points can be a case of Reference Class Forecasting, or RCF, and how you can experience the benefits of RCF at the user story level, and you can avoid some of those common psychological issues around estimating. We'll also show how the way people often use story points misses the well, point.   Peter: Mmmhh.   Richard: Yeah, you see what I did there?   Peter: Nice.   Richard: And it, uh, fails to get those benefits from Reference Class Forecasting, and we'll share some ways to fix the common issues.   Peter: So we just did a big episode on this, but I'll give you a quick overview of what Reference Class Forecasting is, how it overcomes some of the challenges like optimism bias for project estimates.   The big idea is that most of us, when we estimate a project, we look at all the components, we break 'em down, we look at each piece, we estimate it, then we add it all up. The problem with that is that we assume that everything will go as planned, that we won't have these compounding delays. The reality is that never happens.   So what Reference Class Forecasting does is ask us to, instead of looking at the inner pieces of it, the inner view, to say, well, let's just look at projects that are kind of like this. How long did those take, historically? Let's get a bunch of those, a class of initiatives, projects. Let's put it in a database and let's just say, when we're doing this thing, let's find something that was like it, and use that for the forecast for that project.   Richard: But once you're into a project, it's useful to know how much will fit in a Sprint, or how far are we going to get down our backlog this quarter. Those are real questions that teams and their stakeholders need answers to. So we need some way to estimate without having to do, "what are all the little tasks we're going to do? How long are they gonna take? Add them back up and immediately fall back into all of the problems that we were trying to avoid with Reference Class Forecasting at a higher level.   This is where story points come in when you do relative estimating at the user story level. It looks a lot like RCF if you have good past examples of, these things took about the same amount of time. These things took a bit longer. These things took quite a bit longer, and you could even put numbers on those if you have a ratio between them. And you can say this one is more like these. This one's more like these other ones.   Now we're not doing the bottom up inside view kind of estimating. We're doing a little bit more like the outside view, even though it's more granular.   Peter: So the key move here that most teams skip or just don't do well is to get a good set of past reference stories of different sizes organized into like groups. We used to call these baseline stories and we had 'em written up on the whiteboard. These are what ones look like, these are what twos look like, these are what threes look like. And then you, when you're pulling up the next one to estimate, you just say, what class is it in? Oh, this looks a lot like those other twos.   The smaller the piece of work you're estimating, the easier it is to start thinking in terms of tasks and time. If I have something that's pretty small, I can say, ah, that looks like just a day or two of work. That's the risk we fall into. So we have to be careful to just keep it to that, "what is it like, what is it like, what is it like?"   Richard: Yeah, that is the key move. There's nothing magical about the points. It drives me crazy when people use the points, but then they turn it into bottom up estimating with this extra layer of abstraction. Like, what are all the tasks we need to do? Okay, that's gonna take about this long. When something takes about that long, how many points do we give it?   It completely misses the point and just makes everything harder, and I think that's why people are so frustrated with this on so many teams. It's just extra work.   Peter: The human tendency to want to do that inside view, that bottom up estimation approach is so strong that it's really hard to overcome it. So it's really important to ensure that if you're gonna use story points, follow the principles of Reference Class Forecasting. Don't bother to do it if you're not. Otherwise, you're getting the kind of the worst of both. You get this cynical approach to story pointing to say, well, we all know on our team that a five point story takes 7.2 person hours.   That's not what it's about.   Richard: And I used to think that doing any mapping between points and time was a problem. But the more I've learned about the research behind Reference Class Forecasting, the more I've realized it's not the fact that you're mapping, that's always happening with estimating, even if you're doing reference class. It comes down to the direction of the mapping.   So if you're saying, I think this will take X hours or X days, and therefore it's Y points, you're just doing bottom up estimating, and story points are a wasteful abstraction. On the other hand, if you're saying, I think this story is like these other ones, therefore it's X points, and since it's X points, it's likely to take the same amount of time as those other ones.   Now you're getting the benefits of Reference Class Forecasting at the story level. The easiest way to avoid getting it wrong is to generally avoid doing that mapping at the individual story level. And just do it for larger collections of things, a sprint, a quarter. You probably don't wanna go further than that at the story level. And the further out you go, the more you may want to think about bringing in something like the risk adjusted forecasting that Jim Shore describes on his website and his book,   Peter: Richard, we've been doing this for a long time. We've seen teams do it really well. We've seen lots of teams do it poorly, and we've seen teams that are frustrated and want to fix it. What are some of the most common questions you get and how do you answer those questions?   Richard: Uh, I'll start with a couple common questions that I get. One that I see a lot: usually people are trying to do this relative story estimating as a whole team. This is where practices like planning poker come in. Sometimes it's even just a conversation. What do we think this is?   And the question I get is, what do we do when we're always debating between two adjacent sizes? Like we can never agree if this is a five or an eight, what should we do?   And my advice in that situation is simple: when you notice that you're getting stuck and nobody's persuading anybody else, just go up. Can't agree that it's a five or an eight. It's probably between five and eight, so go up to eight.   Like we talked about in the last episode, errors tend to accumulate on the "things are longer than we expect" end. There's not a normal distribution of overruns and underruns. It tends to run over, and check out the last episode for more about that.   So if you can't agree, go up. That's probably closer to right most of the time.   Speaking of numbers, I used five and eight. It's common for people to use this modified Fibonacci sequence. 1, 2, 3, 5, 8, 13, 20, 40. People will ask, "do we have to use that Fibonacci sequence?" And my answer is no. You can use any numbers you want, but it is nice to have a scale that balances having a handful of options in a single order of magnitude.   Because humans are pretty good at relative comparisons between like one and 10, and not so good when it goes beyond that. So powers of two gives us four numbers. 1, 2, 4, 8. Fibonacci gives us five in that range, so we get a little more precision, but it gets less precise as we get bigger, which I like.   The difference between two and three is 50% that matters. The difference between eight and nine is 12.5%.   Peter: Good "mathing".   Richard: Yeah. Thank you. I'm glad I got that right on the spot. Yeah, and that, that's within like the margin of error for something that large. So I like that the spacing gets larger as the numbers go up, so, you don't have to use it.   I do recommend you use numbers though. Sometimes people use t-shirt sizes and there's not a relationship between them that's predictable. Like a medium shirt isn't two small shirts. And that makes it hard to do the aggregation that keeps you out of the bottom up view.   Peter: Right? Um, another question I hear a lot is related to who's gonna do the work. If so and so does it, it'll go really fast, but if this other person does it, it's gonna go slower. And you're heading back to that inside out view. Like you're starting on the inside and breaking it down and saying, how long will this part of it take, and that part of it take?   Instead just go back to the reference class and find the like item. What is this thing like? And look at the historical data. If you're looking at the aggregate of data over multiple sprints, you're gonna see that average out and that's gonna be the best way to estimate it.   Richard: Yeah. And, and that one is really good at avoiding a common issue, which is ignoring things like testing. I think it's gonna be really fast. And it turns out, yeah, it was fast to program, um, maybe even faster now with your AI agents doing it for you. But getting all those tests right and making sure it works for all these different cases may actually be most of the work and your historical examples are going to contain that variation much better.   Speaking of people doing the work, uh, another question I get about that is can we just have one person estimate on behalf of the team? Can we just have the dev lead do this for us?   There's the math side of that, and then there's like the human collaboration side of it. The more I've seen the research around Reference Class Forecasting, the more the math answer is, yeah, it's probably fine to have one person do it. It's probably gonna be good enough if your reference class database is good enough.   However, I think there's good human reasons not to do that. I think it's good to hear from all the people who are going to have to live with the estimate. I think it's good to get different perspectives into it. And if you're approaching backlog refinement like we recommend, which is where you have smaller groups doing a lot of the things like story splitting and getting details on things, then estimating becomes the place where everybody gets realigned after that work. And the estimate conversation is actually a check for alignment. Getting information back out.   So I would suspect if you have one person doing a lot of the estimating, you're going to not discover misalignment until you actually get into the work or even review the work. And that's probably later than you want.   Peter: Another challenge I see, Richard, is that sometimes we have teams that are not structured the way that we would recommend. They, they're not really doing user stories, they're doing some functional piece of the work, a programming team or a testing team or a design team, and, you know, that's not what we'd recommend. It's kind of a separate problem, but they're not really working on user stories.   And so they say, well, can we use story points? 'cause we don't have actual user stories. And you know, we'd advocate that you locate stories within a single team and build a cross-functional team around that. But you can still estimate if you have reference classes of previous work.   Reference classes are just past examples of the same kind of thing, right? So making that reference class a story encapsulates more complexity. But you can do the same thing with functional tasks.   Richard: And if you wanna learn more about how to get good stories, 'cause sometimes you don't have stories, just 'cause it was hard to break down larger things into stories.   So check out our story splitting guide on our website, or join me in a CSPO or Advanced CSPO program, where we go into much more about how to structure a backlog in a useful way so you get increments of value, resolving complexity little by little.   Peter: And if you want help improving your approach to estimation, from the story level up through that initiative at release level, you can schedule a free coaching call with Richard or I on the contact page on our site.   We teach how to use both levels of Reference Class Forecasting, the initiative level and the story level in our CAPED approach. And if you're interested in learning more, check out the CAPED page on our site. If you're ready to bring some CAPED to sanity to your team, register for our upcoming Certified CAPED Consultant training and we'll drop a link to those things in the show notes.   And of course, if you get value from the show and you wanna support it, the best thing you can do if you're watching on YouTube is subscribe like the episode, and click the bell icon to get notified of new episodes. Drop us a comment with what you've seen work well for estimating, for story points.   If you're listening on the podcast, a five star review is really helpful to us. It lets other people know about the show. We'd appreciate your help there, and thanks for tuning in to this episode of The Humanizing Work Show. We'll see you next time.

#196 How to Fix Story Point Estimation

Show Notes

Episode Transcript

Other Episodes

#83: Do we need a sprint goal?

#115 Fix Your Meetings with DAVID

#67: A Path to Real Empowerment for Teams