I Built an AI Code Maintenance Tool and Ran It on Itself
The first real test of any analysis tool is whether it is useful on its own codebase. Not a toy repository. Not a demo. The actual code that runs it.
So I ran Endure on codeslick — the monorepo containing Endure itself — and found a calibration bug in the scorer. The tool flagged its own core file as the highest-debt file in the project, and when I looked at why, I found the scoring was wrong in a specific way I had not noticed during development.
Here is what I found, what I fixed, and what the tool still cannot do.
What Endure measures
Endure calculates a technical debt score for each file in a repository using five components: churn (how often the file changes), complexity (cyclomatic + cognitive, normalized by lines of code), staleness (how long since last commit), duplication, and intent gap (whether developer intent has been captured for that file).
Files score 0–100. The output is a ranked list of files most likely to cause maintenance failures — not just the ones that are complex, but the ones where complexity and change frequency intersect.
Finding A: The files it got right
After analyzing 943 files, the two highest-debt files were:
| File | Score | Churn | Complexity | Severity |
|---|---|---|---|---|
| technical-debt-scorer.ts | 61.7 | 95.5 | 88.4 | HIGH |
| server.ts | 51.9 | 100 | 43.8 | HIGH |
Both rankings are correct. technical-debt-scorer.ts is the core file that calculates debt scores. Every calibration improvement, every new scoring rule, every bug fix touches it. It has the highest churn of any logic file in the project because it is the file I am most actively developing. That is exactly what a high-debt score should mean.
server.ts gets modified with every new route, every new middleware, every infrastructure change. Churn of 100 means it has the highest commit frequency relative to everything else in the repo. Correct.
The scorer's intuition matched my own intuition about which files cause the most friction. That is the baseline. The next finding is where it got more interesting.
Finding B: The calibration bug the tool found in itself
Every file with CRITICAL complexity was showing a complexity score of 95 — regardless of size. A 50-line utility and a 2,000-line analyzer looked identical. That is wrong.
A 2,000-line CRITICAL file is categorically harder to change than a 50-line one. It has more surface area, more implicit dependencies, more accumulated context that is not written down anywhere. Treating them as equal understates the risk of the large file and overstates the risk of the small one.
The fix was a LOC-weighted formula per severity bucket:
Where bucketBase is 75 for CRITICAL, 70 for HIGH, 40 for MEDIUM, 15 for LOW. The bonus term adds up to 20 points for files over 100 lines, scaling logarithmically. A 2,000-line CRITICAL file now scores around 90. A 50-line CRITICAL file scores 75. The ceiling stays at 95.
The bug existed because I had originally set the CRITICAL bucket base at 95 — the maximum — leaving no room for LOC differentiation. Running the analysis on a real, heterogeneous codebase made the flatness visible in a way that unit tests on synthetic data had not.
Finding C: High complexity, low churn — the silent risk
The most underrated finding was not in the high-debt list. It was the files that scored MEDIUM but showed complexity above 90 with churn near zero.
python-analyzer.ts: complexity 93.7, churn that barely registers. Endure correctly scores it MEDIUM because it is rarely touched. But that is exactly what makes it dangerous — not now, but the next time it needs to change.
Nobody has touched that file recently. The implicit knowledge of why it is structured the way it is has dispersed. The intent gap score is 100, meaning no developer intent has been captured for it. Stable, yes. But stable and complex with no captured intent is the profile of a file that will take three times longer than expected to modify safely.
The pattern worth naming:
Endure calls these MEDIUM. I call them “deferred high.” The debt is not being paid right now — it is accumulating invisibly.
What I changed after the analysis
Four concrete changes came out of this run:
- 1LOC-weighted complexity scoring. The formula above replaced the flat bucket values. Scores now differentiate within severity tiers.
- 2Tooling file exclusion. Lock files (
package-lock.json,yarn.lock,go.sum, etc.) were appearing with artificially high churn scores. Churn in a lockfile means “dependencies changed” — not “this file needs attention.” - 3Documentation file exclusion. Markdown files were showing up in the top-20 debt list. High churn on a dev log means “active documentation” — the opposite of a risk signal.
.md,.mdx,.txt,.rst, and Jest snapshots are excluded by default now. - 4Recommendation text. The scorer was outputting “Consider archival” for stable high-complexity files. That implies the file is dying. The correct recommendation is “Schedule modularization — complex and stable, but will resist any future change.” The wording matters because it changes what a developer does next.
What it still cannot do
Intent coverage is 16 of 943 files. The intent gap score is currently binary — a file either has captured intent or it does not. There is no gradient for partial intent, low-confidence intent, or intent that was accurate when written but has since drifted from the implementation.
Churn scores re-normalize on every run relative to the highest-churn file in the repository. A file that scores 80 today might score 65 in six months if a new file becomes the new maximum — even if nothing changed in the original file. The score is a relative ranking, not an absolute measurement.
Semantic drift detection — whether the code still does what the developer intended when they wrote it — is not implemented yet. That is the next hard problem.
Try it on your codebase
Point Endure at your own repository. Public GitHub repos work without an API key. When the analysis finishes, look at your top-10 list and ask whether the ranking matches your intuition about which files cause the most friction.
That gap — between what the scorer thinks is risky and what you know is risky — is the most useful output. It either confirms the signal or tells you where the model is wrong.
Try Endure →The point
If your tool is not useful on your own codebase, you have not built a tool — you have built a demo.
Running Endure on itself found a real bug, confirmed two real rankings, and surfaced a category of risk I would not have articulated clearly without the data in front of me. The calibration will keep improving. The intent coverage will grow. The silent-risk category will get a better score.
But the first run was useful. That is enough to keep going.
One question I am genuinely curious about:
What is the file in your codebase that everyone is most afraid to touch? Not the most complex one on paper — the one where someone says “don't change that, nobody remembers how it works.” Does it show up at the top of your Endure list, or is it buried in MEDIUM?
I read every reply — send me a note.