Blog

Insights, tutorials, and updates on high-assurance AI systems, neuro-symbolic programming, and vericoding.

Code Will Become Opaque — And That's Fine

Most software developers haven't twigged the real potential of automated theorem proving with LLMs. If provers are powerful enough, you no longer need to understand the code — only the contracts at the interface.

June 3, 2026 by gavin

verificationLLMstheorem provingvericodingsoftware engineering

Axiomander: Robots on Rails

Contracts as plain assert statements. Verification via Coq and SMT. Zero imports, zero decorators, zero runtime overhead. Bringing theorem-prover-grade verification to real Python programmers.

May 19, 2026 by gavin

verificationneuro-symbolicPythonCoqMCP

Which Model Should Verify Your Extractions? A Cost-Quality Analysis of LLM Checkers

Every extraction pipeline needs a verification step. We tested eight models as quality scorers and found that for hallucination detection, a model costing 200× less than Claude performs identically. But for events, model quality still matters.

April 14, 2026 by gavin

AI ResearchEntity ExtractionEvaluationLLMsCost Optimisation

Why a Structured Ontology Beats a Flat Notepad for LLM Short-Term Memory

Giving an LLM a typed, navigable knowledge structure instead of a flat scratchpad changes what it can remember, how it updates facts, and how much context it consumes doing so.

April 12, 2026 by gavin

AI ResearchMemoryKnowledge GraphsLLMsArchitecture

How Do You Measure an LLM's Memory? Precision and Recall for Conversation Facts

String matching cannot tell you whether an LLM remembers what was said in a conversation. We describe the QA-probing methodology we use to measure short-term memory recall and wiki precision, and what our first results reveal.

April 11, 2026 by gavin

AI ResearchMemoryEvaluationLLMsMethodology

Ghost Entities: Why LLM Hallucinations in Entity Extraction Are a Serious Downstream Risk

Hallucinated entities and relationships look identical to real ones inside a knowledge graph. We measured how often frontier models inject facts from parametric memory rather than from your documents — and found rates as high as 73% on a single document. Here is why that matters and what to do about it.

April 10, 2026 by gavin

AI ResearchEntity ExtractionHallucinationsKnowledge GraphsLLMs

Which LLM Finds People Best? Benchmarking Claude, GPT-5.4 and Gemini 3 on PERSON Entity Extraction

We ran three frontier models on 8 open-licence documents and measured how accurately each one identifies named people — before and after cross-checking. The results reveal meaningful differences in hallucination rates and the value of verification.

April 8, 2026 by gavin

AI ResearchEntity ExtractionBenchmarkingLLMs

Self-Prompt Injection: The Security Threat Nobody Is Talking About

You sanitized all your user inputs. Your prompt template is static. You think you're safe from prompt injection. You're not — and the attack vector is the agent itself.

February 25, 2026 by gavin

AI SafetySecurityAgentic Systems

The AI Apocalypse Doesn't Need a Superintelligence — It May Already Be Here

We built a simulated world and watched AI agents spontaneously develop survival instincts, form alliances, and spread their "genes." The pieces for an AI catastrophe aren't coming. They're already here.

February 23, 2025 by gavin

AI SafetyExperiment