The art of building and maintaining (computer + people) systems that keep solving real problems

I’m sure you’ve noticed that we have a problem with systems that no longer work, or never worked, and there’s no “obvious” (read this as “easy”) solution. And by systems, I mean the entire organization of people, technology, and processes; it’s easy to point out single flaws, but if the system doesn’t solve the problem, are the flaws the cause, or is it something deeper?

Here’s a first draft of some thoughts on the matter, from the perspective of someone who has worked in many organizations over several decades. Let me know if it resonates for you.

For example, what do you do if the strategic ‘big picture’ isn’t clear? This happens frequently in large institutions with legacy systems, because ‘it has always been that way’ and ‘no one knows.’ It also happens for startups, as they strive to find their niche. How do you find, or discover, that ‘big picture?’ I would argue that the answer needs to be a system that can evaluate its circumstances, and adapt accordingly.

To that end, here’s what I would consider a checklist from the perspective of building and maintaining all kinds of systems. Be careful about being trapped by the assumption that you’ve covered it with your first answers; you need to reassess periodically, and be open-ended and flexible. I’ve also included some follow-up questions for digging in deeper and questioning quick assumptions.

  1. Why are we doing this? No, really, why this, and not something else?
  2. Is it working, overall? Do we check in on that regularly and learn?
  3. What isn’t working? Does that matter? How do we manage in the meantime? When will it matter or overwhelm us? What should it look like when it works?
  4. How does it fit with the other things we are doing?
  5. Is there a provably better way? If not, how can we make one?
  6. Are we committed to learning from our mistakes, instead of punishing a few people and doing the same thing over and over?
  7. What, exactly, are we doing? Is that clear to everyone, and do we all understand our roles in making progress?
  8. What are the next steps?
  9. What outcomes do we want to see? By when?
  10. How will we know when we are finished? How do we know if it’s working?
  11. What comes next?
  12. Can we adapt to changing circumstances?

In my experience, many legacy systems, governance approaches, and organizations, are confused about these issues, and they settle for ‘we always have done it this way.’ They especially miss the last two points, and assume that those somehow come from outside of ‘the system.’ If you cannot monitor, operate and successfully adapt a ‘finished’ solution, it’s not finished. But you might be finished as an organization. Not now, but soon.

This is the trap that (for example) almost all employment insurance systems have fallen into. It’s fashionable to pick on government, but that’s not what I’m doing here, it’s just a visible example. The code is legacy and barely maintainable; the staff are primarily operations (and truly doing their best!); the laws that caused these systems to be created, and the way that they are operated by people of mixed skills, are antiquated. But: if a replacement, or supplemental system is built without this checklist in mind, it will suffer the same fate. This is also true of the laws themselves; if they aren’t working, then they need to be revisited. To pick on employment insurance again, in many parts of the world, if you earn more than a small amount of money while receiving employment insurance, there is a massive and regressive claw-back. This means that the very lowest income bracket is taxed at the *highest* rate, because of an antique Dickensian view that somehow working and receiving any kind of assistance are incompatible. But the results speak for themselves; people eligible for employment insurance payouts always have to think “Is this a full-time job so I can afford to lose my benefits, or do I have to take a gamble and take something smaller and hope I can make it bigger, while losing my benefits?” As a society, and even as the people who design these systems, we have to think: is it right, that we ask the most vulnerable to take the biggest risks? How can we prepare our work so that this can change, if we can get the larger strategy changed? How would we know if it would be better? How can we mitigate undesirable effects? Should we run the legacy system in parallel and cut over suddenly?

On the software side, your organization, or your outside vendor, should be able to reproduce and update your system on demand, from source code. This was true when Joel Spolsky wrote this in 20 years ago, and it’s even more true now.

Automated build and deployment technology exists, it’s inexpensive, and it’s a game changer. If you can’t do this, then you don’t control your fate. I have seen this many times across my career, working inside large organizations, advising organizations large and small, and working inside small and growing startups.

The same is true of the people and process parts of the system; your software components should be helping to guide the changes that need to happen in people’s behavior, and how customers or clients interact with the system as a whole, as well.

Here’s a checklist of questions you should be asking about operability and adaptability:

  • What happens when your application or system breaks, or a new impactful requirement emerges?
  • Who are you going to contact?
  • Have you made those arrangements?
  • Are your team and your vendor(s) team(s) staffed and equipped appropriately to respond?
  • What’s the actual time look like to field a fix or a needed change, and how much instrumentation and automated testing do you have in place so that you know before the users do?
  • Do you and your employees use it regularly to see how it works?
  • Do you know that it’s working *right now*? How, or why not? Why isn’t that visible and automated?
  • Can repairs and changes be automated and accelerated? If no, what would it take?
  • Are all of the changes visibly highlighted, so that people know what to do?

As you can tell, I am hoping that you will conclude that things are not all as they should be, and that work in these areas will pay off handsomely. Don’t accept the status quo. Systems built now can and should be deployable, including changes, in minutes. This means that software is a great bet for helping people and processes to be adaptable. If you have a legacy system where that’s not true, start planning for how to fix that by supplementing and eventually replacing with a modern system. Your organization’s very survival depends on it, because brittle systems break, and stay broken.

Feedback, agreement, and cogent disagreements are always welcome.