Here is a conjecture: we will soon transition to a world in which software code is as opaque to software engineers as machine code currently is.
If this sounds alarming it should be. But it doesn’t have to be a disaster.
It’s Already Happening
A developer at a meetup in San Francisco said something that stuck with me: “Code review? Nobody does that anymore.” He wasn’t complaining, he was merely stating a fact. The speed pressure on developers is real, the AI output is voluminous, and the humans in the loop simply aren’t reading it anymore. Code is being written, reviewed in name only, and shipped — and this is happening at serious companies with serious engineers.
This is also not entirely irrational. There is a subtler reason too: trying to drive the structure of AI-generated code too heavily tends to confuse the model. LLMs perform best on well-trodden paths — patterns they have seen thousands of times in training. The more you push the model toward your idiosyncratic architectural preferences, the more you steer it into less-travelled territory where it makes more mistakes. Sometimes the best prompt is the one that gets out of the way and lets the model produce idiomatic, conventional code.
The upshot is that code opacity is already the reality for a large and growing fraction of shipped software. The question is not whether we accept it. It is whether we accept it safely.
The Machine Code Parallel
Nobody writes machine code any more. Virtually no working programmer can read it fluently. Yet we build extraordinarily complex and reliable systems on top of it, because we have compilers — tools that translate from a language we understand into one we don’t, with strong guarantees that the translation is correct.
The same transition is coming for source code. LLMs are already writing most of the implementation. What most developers haven’t yet twigged is what will make this safe: automated theorem proving.
Why Theorem Proving Changes Everything
If a theorem prover is powerful enough, you can establish strong guarantees about what a piece of software actually does — with little to no programmer effort. You don’t need to read the code and convince yourself it’s correct. You need to specify what correct means, and let the prover check it.
This is particularly valuable with LLM-generated code, precisely because you can’t fully trust it. The LLM might produce something that looks right, passes your tests, and is subtly wrong in a way that surfaces under specific conditions you didn’t think to test for. A theorem prover doesn’t test. It proves. Either the property holds for all inputs, or it doesn’t.
The P ≠ NP Advantage
The P ≠ NP conjecture — which nearly all computer scientists believe to be true — implies that for many problems, checking a solution is fundamentally easier than finding one. This has historically been a barrier for automated reasoning systems: they get stuck trying to enumerate the right proof path.
What LLMs bring to this is an uncanny capacity to guess effective lemmata — the intermediate steps that bridge from what you know to what you want to prove — in cases that completely stumped deterministic search. It doesn’t work every time. But it works far more often than anything before it, and in places that were previously considered out of reach.
The harder guesses that once required expert human mathematicians to unstick are increasingly within reach of an LLM oracle. And at the same time, LLMs can do the hard work of writing the contracts themselves — adding preconditions, postconditions, and loop invariants to code and iterating until the proof goes through.
What This Means in Practice
At Scidonia we are using our Axiomander MCP to establish properties of Python code in a way that requires nothing beyond plugging into an MCP server. Contracts are plain assert statements or doc-string — no new language, no imports, no decorators. Your LLM writes the code, writes the contracts, and drives the verification loop.
The prover works in three tiers. Level 1 uses structural weakest-precondition reduction plus linear arithmetic (lia) and clears the majority of goals automatically. Level 2 calls an SMT solver (cvc4 or z3) for non-linear arithmetic and division. Level 3 is an LLM oracle running an interactive Coq proof session — the thing that gets unstuck when deterministic search gives up. Under python -O, every assert and every ghost variable is stripped by the bytecode compiler. Zero runtime overhead.
Here is what the contracts look like in practice:
def transfer(balances: dict[str, int], sender: str, receiver: str, amount: int) -> int:
"""
axiomander:
where:
old(sender): int = balances[sender]
old(receiver): int = balances.get(receiver, 0)
requires:
amount >= 0
balances[sender] >= amount
ensures:
balances[sender] >= 0
balances[sender] + balances[receiver] == old(sender) + old(receiver)
"""
balances[sender] -= amount
balances[receiver] = balances.get(receiver, 0) + amount
result = 1
return result
The loop invariant on a sorting function, the conservation law on a financial transaction, the branch-precise guarantee that a clamp never exceeds its bounds — these are the things the LLM writes and the prover checks. You as the programmer only need to agree that the contracts at the interface describe what you actually want.
The practical upshot: what you need to understand is no longer the implementation but the interface contracts. Does money transfer conserve the total balance? Does this sort function always return a list in order? Does this authorisation check never grant access it shouldn’t? The code that implements them can be as opaque as machine code — as long as the contracts are clear and the prover has checked them.
This is not a loss. It is the same bargain we already made with compilers, with operating systems, with every layer of abstraction we use without reading. We are about to make it with application code too. But lets make sure we get it right! The bargain only works if we are not making the outcomes worse.