Tech Debt is an Ego Problem
Early in my career, I got brought into a fintech for what I thought was a cloud transformation project. They ran B2B software - a separate hosted instance for each client, something like 8,000 copies. The kind of setup that makes platform engineers twitch.
In the first fortnight, I put together what I thought was a solid proposal. Shared identity platform. Migrate storage to sharded S3. Strangler pattern to incrementally move clients over. Textbook cloud-native modernisation. I was pretty pleased with myself.
I got shut down
The response was something like: “This is not what we are paying you for.” I thought they were getting a deal. They thought they were getting random whiteboard ivory tower nonsense.
The company had been sold a lift-and-shift. That’s what they’d bought, budgeted for, and signed off on. They weren’t looking for someone to redesign their architecture - they were trying to get out of an unstable datacenter before it cost them clients.
The risk of the migration itself was already keeping people up at night. 8,000 client instances, each one a potential outage, each outage a potential lost contract. And here I was, two weeks in, proposing we also rearchitect their identity system and storage layer. I might as well have suggested we rewrite the whole thing in Rust while we’re at it.
I was wrong
The 8,000 instances worked. They were a pain to manage, sure, but they were a known pain. The risk and cost of a major rearchitecture, on top of a datacenter migration, would have been enormous - and for what? To make the architecture prettier? At that moment, for that business, the cost of carrying the “debt” was less than the cost of paying it off.
Sometimes the right answer is to leave it alone. I didn’t get that then. The 8,000 instances were ugly, but they were a known ugly - and my “fix” would have introduced months of uncertainty during a period when they couldn’t afford any. The debt was cheaper to carry than to pay off. I just couldn’t see it because I was too busy being technically correct.
Stop Saying “Tech Debt”
Part of the problem is the phrase itself.
Ward Cunningham coined “technical debt” in 1992 to explain something specific: the conscious decision to ship code you know isn’t ideal, in order to learn faster or hit a deadline. Like financial debt, you take it on deliberately, knowing you’ll pay interest until you pay it down.
That’s not how anyone uses the term anymore. “Tech debt” has become a catch-all for any code someone doesn’t like. Legacy systems. Bad architecture decisions made by people who left years ago. Shortcuts nobody remembers taking. Code that’s just old. It all gets lumped together as “debt” - which makes the word meaningless.
When everything is debt, nothing is. Your manager hears “we have tech debt” and thinks: yeah, everyone says that. It’s like saying “we have meetings.” Tell them something they don’t know.
The other problem: debt implies someone borrowed something. You took on the debt to gain something - speed, learning, market timing. But most of what engineers call tech debt wasn’t a decision at all. Nobody sat down and said “let’s make this worse now to ship faster.” It just happened. Accumulated. Rotted.
That’s not debt. That’s just entropy.
Martin Fowler’s technical debt quadrant is useful here. He breaks it into two dimensions: was it deliberate or inadvertent? Was it reckless or prudent?
Only one quadrant is actual strategic debt - the deliberate, prudent kind. “We know this isn’t ideal, but shipping now is worth the trade-off, and we’ll address it later.” That’s borrowing against the future with eyes open.
The rest isn’t debt in any meaningful sense. Reckless/deliberate is negligence - you knew it was bad and didn’t care. Reckless/inadvertent is a skills gap - you didn’t know enough to know it was bad. Prudent/inadvertent is just learning - you discovered a better approach after the fact.
Calling all of these “debt” muddies the conversation. When you say “we have tech debt,” your manager doesn’t know if you mean “we made a strategic trade-off that’s now due” or “some code is old and I don’t like it.”
Stop using the phrase. It doesn’t communicate anything useful to anyone outside your immediate team. Instead, talk about the actual problem:
- “We’re spending 30% of each sprint on workarounds for the payment system”
- “The last three outages all traced back to the same service”
- “Onboarding a new engineer takes four weeks instead of one because of tribal knowledge”
Those are specific. They have numbers. They imply cost. Your manager can do something with that. “We have tech debt” gives them nothing.
Make It a Number
“This code is frustrating to work with” isn’t a business case. Neither is “this system is unreliable” or “we’re accumulating debt.” Those are complaints. Your manager has heard them before, probably from you.
A business case has numbers. Cost, risk, opportunity. Things that can be compared against other things competing for the same resources. When product comes in asking for three engineers for six weeks to build a new feature, they’ve got revenue projections attached. You need the same.
This is harder than it sounds. Most technical problems don’t come with price tags attached. You have to estimate, and estimation feels uncomfortable - what if you’re wrong? What if someone challenges your numbers?
A rough estimate beats no estimate. “This is probably costing us somewhere between $50K and $150K a year” is infinitely more useful than “this is bad.” Your estimate might be off by a factor of two. It’s still better than nothing, and it gives people something concrete to discuss. “I think your estimate is too high because X” is a productive conversation. “We have tech debt” is not.
Direct Spend
This is usually the smaller bucket - but it’s the easiest to defend because receipts exist.
Pull up your cloud bills. Are you over-provisioning to compensate for inefficient code? Running redundant systems because you can’t trust the primary? Paying for capacity you don’t need because scaling down is too risky? If you’re running 10 instances when you should need 6 because the code can’t handle efficient load balancing, that’s 40% waste. Multiply by your monthly spend.
Not all over-provisioning is waste. High availability, burst capacity, geographic redundancy - these are deliberate architecture decisions. The waste is the unintentional kind: capacity you’re paying for because you’re afraid to touch the scaling config.
Customer credits and SLA penalties go here too - actual money leaving the company. Every refund you gave because the system was down, every contractual penalty you paid for missing uptime guarantees.
Opportunity Costs
This is where most of the money hides - but it’s harder to sell, and harder to measure.
You can’t hand someone a cheque for opportunity costs. “We saved $200K in engineer time” doesn’t mean $200K appears in an account somewhere - it means those engineers could have been doing something else. The pitch becomes “your team can do X instead of Y,” which means you need to win three separate arguments:
- Y is actually happening. You need to prove engineers are actually spending time on this, not just assert it. That means data - ticket analysis, time tracking, surveys. The sections below show how.
- The fix will actually work. You’re asking for investment based on a prediction. Point to similar fixes that worked - here or elsewhere. Propose a spike to validate the approach. Reduce the ask to something smaller that proves the concept.
- X is worth doing. Even if you free up capacity, it only matters if there’s something valuable to redirect it toward. Point to the backlog. Show the features that keep slipping. Quantify the revenue waiting on things you can’t ship.
Each of these requires evidence, which is why the measurement matters. “Hard to measure” too often becomes “not measured at all” - and then you’re asking someone to believe all three on faith. That doesn’t work.
Incident response time
Every outage has a paper trail. How many incidents traced back to this system in the last year? How long did each take to resolve? Who was involved?
Multiply engineer hours by loaded cost. Loaded cost is salary plus overhead - benefits, equipment, office space, management time. Ask finance for your company’s actual number; it varies widely (1.3x at lean startups, 2x+ at enterprises with heavy benefits). If you can’t get it, 1.4x is a reasonable starting point for back-of-envelope math.
Three outages last quarter, averaging 15 engineer-hours each to resolve, at $120/hour loaded cost: that’s $5,400 in engineering time. Not huge on its own, but it adds up. And that’s before you count the opportunity cost - whatever those engineers weren’t building while they were firefighting. If you have customer success or support teams, add their time too.
Delayed revenue
Talk to sales and product. Did we lose deals because of technical limitations? Did we delay launches? Salespeople remember the ones that got away - “we lost Acme because we couldn’t offer SSO” is a real number. These conversations get you data you couldn’t get otherwise, and build allies who are invested in the outcome when you present your proposal.
Velocity drag
The biggest cost, and the hardest to see. Ask your team: “In the last sprint, how much time did you spend on workarounds, legacy issues, or fighting tooling?” They know. Or pull tickets from the last few months and categorise them: new features, bug fixes, maintenance, fire-fighting. If 40% of effort is going to maintenance, that’s a velocity tax you’re paying every sprint.
People costs
Bad systems cost you people’s time. Onboarding takes four weeks instead of two? That’s two weeks of salary per hire, plus everyone else’s time answering questions. Tribal knowledge is expensive - when “just ask Dave” is how your system works, Dave leaving means that knowledge leaves too.
People don’t quit because of code - they quit because of people and culture. But a rotting codebase signals both: leadership doesn’t invest, doesn’t listen, prioritises shipping fast over shipping well. A senior engineer leaving costs $50-100K direct (recruiters, interviews, onboarding) plus months of reduced output. If turnover is high, the codebase probably isn’t the cause - but it might be a symptom.
Quantifying Risk
Some costs haven’t happened yet. The SLA penalty you’re trending toward. The outage that will happen when that one engineer who understands the payment system leaves. The compliance fine waiting to land because you can’t patch the legacy system without breaking everything.
You don’t have actuarial tables for “chance this legacy system explodes.” But you can estimate. Probability times impact - a 10% chance of a $500K incident is a $50K expected cost.
Where does the probability come from? Ask the people who work on it. “If we don’t fix this in the next twelve months, what’s the chance something breaks badly?” Engineers who live with a system know where the bodies are buried. They know which near-misses they’ve had, which workarounds are held together with tape, which parts they pray nobody touches. Their estimates won’t be precise, but they’ll be grounded in reality.
You can also look at history. How often has this type of thing failed before? If you’ve had three close calls in the last year, “10-20% chance of major failure” isn’t a guess - it’s extrapolation.
The scary ones are concentration risks. One person who understands how it works. One component everything depends on. Those multiply impact in ways that aren’t obvious until something breaks.
Borrow Someone Else’s Credibility
Sometimes you’re not the right messenger. The exact same argument lands differently from McKinsey than from the team lead who’s been “complaining about tech debt” for two years. That’s how organisations work. Use it.
McKinsey found tech debt amounts to 20-40% of technology estate value, with 10-20% of new-product budget diverted to dealing with it.[1] Stripe’s research shows developers spend 42% of their time on maintenance and debt rather than building.[2] DORA’s research correlates lower technical debt with better deployment frequency and incident recovery.[3]
None of this replaces your own numbers. But when your CTO sees McKinsey saying the same thing you’ve been saying, suddenly it’s worth a meeting.
Why This Matters: Defensibility
You might be thinking: I can’t get numbers this precise. My estimates would be rough. Isn’t this just making things up?
The point isn’t precision - it’s defensibility. You need something you can defend when challenged.
Gut instinct works when you already have trust. If you’ve been at a company for five years and have a track record, people believe you when you say “this is a problem.” But gut instinct doesn’t scale:
- New to the org? You haven’t earned trust yet.
- Talking to someone outside your reporting chain? They don’t know your track record.
- Proposal needs three sign-offs? At least one person doesn’t trust your judgement.
- Communicating through a document instead of a conversation? You can’t rely on presence and charisma.
A rough estimate you can source and defend is infinitely more useful than a confident assertion you can’t back up. When someone asks “where did that number come from?” you need an answer better than “it felt right.”
Worked Example: The Checkout Crash
Let’s make this concrete. Your checkout service is unreliable. It crashes under load, maybe once a month, usually during peak hours. Engineering keeps asking for time to fix it, keeps getting told there are higher priorities. How do you make the case?
The numbers below are illustrative - yours will be different. The method is what matters: source each input, show your working, and be ready to defend every assumption.
Step 1: Quantify the incidents
Pull the data from the last year:
| Metric | Value |
|---|---|
| Total incidents | 14 |
| Average duration | 45 minutes |
| Engineering time per incident | 8 hours (including post-incident review) |
Step 2: Calculate direct costs
| Cost type | Formula | Result |
|---|---|---|
| Engineering time | = 14 × 8 × 120 | $13,440 |
| Customer credits | = 14 × 200 | $2,800 |
| Direct total | = 13440 + 2800 | $16,240 |
Not huge. This is why direct costs alone don’t make the case.
Step 3: Estimate revenue impact
During those 45-minute windows, customers couldn’t check out. How much revenue did you lose?
| Metric | Formula | Result |
|---|---|---|
| Hourly revenue (peak) | $8,000 | |
| Downtime per incident | = 45 / 60 | 0.75 hrs |
| Revenue lost per incident | = 8000 × 0.75 | $6,000 |
| Annual lost revenue | = 6000 × 14 | $84,000 |
Step 4: Factor in customer churn
Some customers who hit an error don’t come back. If your checkout crashes while someone’s trying to pay, they might go to a competitor - permanently.
| Metric | Formula | Result |
|---|---|---|
| Transactions affected (est.) | 50 | |
| % who never return (est.) | 10% | |
| Customers lost per incident | = 50 × 0.10 | 5 |
| Customer lifetime value | $400 | |
| Annual churn cost | = 14 × 5 × 400 | $28,000 |
Step 5: Total it up
| Category | Formula | Result |
|---|---|---|
| Direct costs | $16,240 | |
| Lost revenue | $84,000 | |
| Customer churn | $28,000 | |
| Total annual cost | = SUM() | $128,240 |
Now you have a number. “Fixing the checkout service would save us roughly $130K a year” is a conversation. “The checkout service is unreliable” is not.
Step 6: Compare to cost of fixing
If engineering estimates 6 weeks of work for two engineers to properly fix the service:
| Investment | Formula | Result |
|---|---|---|
| Engineering time | = 6 × 2 × 40 × 120 | $57,600 |
| Payback period | = 57600 / 128240 × 12 | 5.4 months |
The fix pays for itself in under 6 months. After that, it’s pure savings. That’s a business case.
When Should You Fix It?
Not all debt needs fixing immediately. Some debt is cheap to carry. Some is getting more expensive every month. The question isn’t just “should we fix this?” but “when?”
This is where the compound interest metaphor becomes useful - and where you can actually do the maths.
The carrying cost is what you’re paying every month to not fix the problem. Lost revenue, engineering time, customer churn - whatever you calculated above, divided by 12.
The fix cost is the one-time investment to make it go away.
The break-even point is when your accumulated carrying costs exceed the fix cost. Before that point, it’s cheaper to carry the debt. After that point, you’re losing money by not fixing it.
For the checkout example:
- Carrying cost: ~$10,700/month
- Fix cost: $57,600
- Break-even: 5.4 months
After 5.4 months, every month you don’t fix it costs you $10,700. After a year, you’ve paid $128K to not spend $58K.
But some debt works differently. If the carrying cost is $2,000/month and the fix costs $100,000, break-even is over 4 years. If the system might be replaced in 2 years anyway, maybe you just carry it.
The maths doesn’t make the decision for you. But it tells you what question you’re actually answering: “Is the accumulated interest worth more than the principal?” Sometimes yes, sometimes no.
Where This Breaks Down
I’ve been writing as if the problem is always solvable. It’s not.
Sometimes the organisation is genuinely broken. If leadership doesn’t believe in investing in technical health, no business case will help. If decisions are made politically rather than rationally, rational arguments won’t land. You can tell the difference: if you’ve tried multiple approaches, built allies, presented data, and every conversation ends with “we’ll think about it” - the problem isn’t your pitch.
Sometimes the economics don’t work. The break-even calculator assumes the system will exist long enough to recoup the investment. If the product might be sunset, if you’ll rewrite the whole thing in eighteen months anyway - sometimes carrying the debt really is cheaper. Not every codebase deserves to be rescued.
Sometimes you don’t have standing. A junior engineer telling the CTO about organisational dysfunction won’t land, no matter how right they are. Influence requires credibility, credibility takes time. You can’t move something you don’t have leverage on.
Sometimes you’re wrong. Certainty feels the same whether you’re right or wrong. The only way to tell the difference is to stay curious and update when the evidence doesn’t match your model.
If you’ve been making the same argument for years and nothing’s changed, the problem isn’t that management doesn’t understand. It’s that you’re presenting complaints instead of business cases. Or you’re talking about “tech debt” when you should be talking about specific costs. Or you’re proposing fixes without understanding what the person across the table is actually worried about.
Or - and this is the uncomfortable one - maybe you’re wrong about whether it should be fixed at all. I was, once. The 8,000 instances taught me that being technically correct isn’t the same as being right.