I Built an AI Code Maintenance Tool and Ran It on Itself

So I ran Endure on codeslick2 (the monorepo containing Endure itself) and found a calibration bug in the scorer. The tool flagged its own core file as the highest-debt file in the project, and when I looked at why, the scoring was wrong in a specific way I had not caught during development.

Here is what I found, what I fixed, and what the tool still cannot do.

TL;DR

I ran Endure against the codebase it lives in. It caught a real calibration bug: every CRITICAL-complexity file was scoring 95 regardless of size — a 50-line utility and a 2,000-line analyzer looked identical. Four fixes came out of the run. The rest of this post walks through the findings, what changed, and what the tool still cannot do.

What Endure measures

Endure calculates a technical debt score for each file using five components:

—Churn: how often the file changes
—Complexity: cyclomatic + cognitive, normalized by lines of code
—Staleness: time since last commit
—Duplication
—Intent gap: whether developer intent has been captured for that file

The final score is a weighted sum:

0.30 × churn + 0.25 × complexity + 0.20 × staleness + 0.15 × duplication + 0.10 × intent gap

Files score 0–100. The output is a ranked list of files most likely to cause maintenance failures, not just the ones that are complex, but the ones where complexity and change frequency intersect.

Finding A: The files it got right

After analyzing 943 files, the two highest-debt files were:

File	Score	Churn	Complexity	Severity
technical-debt-scorer.ts	61.7	95.5	88.4	HIGH
server.ts	51.9	100	43.8	HIGH

Both rankings are correct. technical-debt-scorer.ts is the core file that calculates debt scores. Every calibration improvement, every new scoring rule, every bug fix touches it. It has the highest churn of any logic file in the project because it is the file I am most actively developing. That is exactly what a high-debt score should mean.

server.ts gets modified with every new route, every new middleware, every infrastructure change. Churn of 100 means it has the highest commit frequency relative to everything else in the repo. Correct.

The scorer's intuition matched my own intuition about which files cause the most friction. That is the baseline. The next finding is where it got more interesting.

Finding B: The calibration bug the tool found in itself

Every file with CRITICAL complexity was showing a complexity score of 95, regardless of size. A 50-line utility and a 2,000-line analyzer looked identical. That is wrong.

A 2,000-line CRITICAL file is categorically harder to change than a 50-line one. It has more surface area, more implicit dependencies, more accumulated context that is not written down anywhere. Treating them as equal understates the risk of the large file and overstates the risk of the small one.

The fix was a LOC-weighted formula per severity bucket:

score = bucketBase + min(20, 5 × log₂(LOC / 100))

Where bucketBase is 75 for CRITICAL, 70 for HIGH, 40 for MEDIUM, 15 for LOW. The bonus adds up to 20 points for files over 100 lines, scaling logarithmically. A 2,000-line CRITICAL file now scores around 90. A 50-line CRITICAL file scores 75. The ceiling stays at 95.

The bug existed because I had originally set the CRITICAL bucket base at 95, leaving no room for LOC differentiation. Running the analysis on a real, heterogeneous codebase made the flatness visible in a way that unit tests on synthetic data had not.

Finding C: High complexity, low churn — the silent risk

The most underrated finding was not in the high-debt list. It was the files that scored MEDIUM but showed complexity above 90 with churn near zero.

python-analyzer.ts: complexity 93.7, churn that barely registers. Endure correctly scores it MEDIUM because it is rarely touched. But that is exactly what makes it dangerous — not now, but the next time it needs to change.

Nobody has touched that file recently. The implicit knowledge of why it is structured the way it is has dispersed. The intent gap score is 100, meaning no developer intent has been captured for it. Stable, yes. Stable and complex with no captured intent is the profile of a file that will take three times longer than expected to modify safely.

The pattern worth naming:

Endure calls these MEDIUM. I call them “deferred high.” The debt is not being paid right now. It is accumulating invisibly.

What I changed after the analysis

Four concrete changes came out of this run:

1
LOC-weighted complexity scoring. The formula above replaced the flat bucket values. Scores now differentiate within severity tiers.
2
Tooling file exclusion. Lock files (package-lock.json, yarn.lock, go.sum, etc.) were appearing with artificially high churn scores. Churn in a lockfile means “dependencies changed” — not “this file needs attention.”
3
Documentation file exclusion. Markdown files were showing up in the top-20 debt list. High churn on a dev log means active documentation, not risk. .md, .mdx, .txt, .rst, and Jest snapshots are excluded by default now.
4
Recommendation text. The scorer was outputting “Consider archival” for stable high-complexity files. That implies the file is dying. The correct recommendation is “Schedule modularization — complex and stable, but will resist any future change.” The wording matters because it changes what a developer does next.

What it still cannot do

Intent coverage is 6 of 666 files (a May 2026 re-run; the March run in Finding A–C analyzed 943 files before documentation and tooling exclusions were tightened). The intent gap score is currently binary: a file either has captured intent or it does not. There is no gradient for partial intent, low-confidence intent, or intent that was accurate when written but has since drifted from the implementation.

Churn scores re-normalize on every run relative to the highest-churn file in the repository. A file that scores 80 today might score 65 in six months if a new file becomes the new maximum, even if nothing changed in the original file. The score is a relative ranking, not an absolute measurement.

Semantic drift detection — whether the code still does what the developer intended when they wrote it — is not implemented yet. That is the next hard problem.

Try it on your codebase

Point Endure at your own repository. Public GitHub repos work without an API key. When the analysis finishes, look at your top-10 list and ask whether the ranking matches your intuition about which files cause the most friction.

The gap between what the scorer thinks is risky and what you know is risky is the most useful output. It either confirms the signal or tells you where the model is wrong.

Try Endure →

The point

If your tool is not useful on your own codebase, you have not built a tool — you have built a demo.

Running Endure on itself found a real bug, confirmed two real rankings, and surfaced a category of risk I would not have articulated clearly without the data in front of me. The calibration will keep improving. The intent coverage will grow. The silent-risk category will get a better score.

A May 2026 re-run confirms the calibration fix worked: technical-debt-scorer.ts — the highest-debt file at 61.7 HIGH in March — now scores 39.0 MEDIUM. That is the LOC weighting doing its job.

But the first run was useful. That is enough to keep going.

One question I am genuinely curious about:

What is the file in your codebase that everyone is most afraid to touch? Not the most complex one on paper — the one where someone says “don't change that, nobody remembers how it works.” Does it show up at the top of your Endure list, or is it buried in MEDIUM?

I read every reply — send me a note.