It Was Never Just About Proofs

June 13, 2026 · lukasz

On December 19, 1903, at the meeting of the American Mathematical Society, Derrick Norman Lehmer presented a paper A Theorem in the Theory of Numbers in which he factored 8616460799 as 89681 \cdot 96079. Several years later, he published two books of mathematical tables: “Factor Tables for the first ten millions containing the smallest factor of every number not divisible by 2, 3, 5, or 7 between the limits 0 and 10017000” and “List of prime numbers from 1 to 10006721”. The former received an enthusiastic review in Nature. Other examples of such vintage calculation-driven mathematical research include Cole’s factorization of 2^{67}-1¹, Shanks’s computation of digits of \pi², or Tarry’s work on Graeco-Latin squares³. Calculation-intensive mathematics done by hand was once scientifically valuable and demanded considerable ingenuity. Today, each of these results can be obtained on a computer in milliseconds.

On May 20, 2026, OpenAI announced that their internal AI model disproved the Erdős unit distance conjecture – the first example of an AI-generated solution to a significant long-standing open problem in mathematics. Earlier AI solutions of open problems have been of much lesser caliber, typically involving non-trivial but not groundbreaking combinations of known results and proof techniques from the literature. Most items on the growing list of AI contributions to Erdős problems fall into this category, as arguably do the proofs produced by AxiomProver. The Erdős unit distance problem seems different — it isn’t some obscure, minor question that few people have worked on, but a major open problem in combinatorial geometry.

The Erdős unit distance problem asks: given n points in the Euclidean plane, what is the maximum number of pairs of points that are exactly distance 1 apart? Strictly speaking, this question is still open. OpenAI’s model disproved a conjecture of Erdős which postulated an n^{1+O(\frac{1}{\log \log n})} upper bound on the number of such pairs. This came as a surprise – the conjecture was widely suspected to be true. In its 80-year history, most mathematicians attempted to find a proof, not a disproof. Producing a counterexample turned out to require bringing in involved concepts from class field theory, which most combinatorial geometers are not deeply familiar with.

Reading the companion remarks reveals a common theme in the assessments by leading mathematicians: the AI-generated solution relies on an ingenious and delicate application of algebraic number theory ideas in combinatorial geometry, generalising an earlier construction of Erdős. Proving the conjecture, as opposed to finding a counterexample, probably would’ve been a more striking achievement for an AI model – it would’ve likely required inventing new concepts and proof techniques. OpenAI’s model instead found a clever connection with another area of mathematics by exploring the search space of existing mathematical artefacts (definitions, theorems, proofs, methods, structures, ideas, …) more comprehensively than a human would. The difference with earlier AI solutions to mathematical open problems is of degree, not kind – more of the same, just better.

This doesn’t diminish the achievement, but positions it within the broad category of “search and recombination” of existing ideas. Such “recombination”, at the level required for resolving something like the Erdős unit distance conjecture, is extremely difficult analytically – far beyond the capabilities of the vast majority of human population. But that’s not all there is to mathematics. Many mathematicians would argue that theory building is central to their field, while solving concrete problems plays a secondary role. The proofs are only interesting insofar as they lead to a deeper general understanding of the subject. Proofs of long-standing conjectures typically do.

Perhaps, in the future, the value of papers which solve open problems without introducing substantial new theory will simply diminish, similarly to how the value of papers solving concrete computational problems diminished after the invention of electronic computers.

There seems to be a growing recognition in the mathematical community that their discipline is about to change significantly. Timothy Gowers thinks that training PhD students has just gotten harder since LLMs became good enough to solve PhD-level “gentle problems”. Jeremy Avigad urges mathematicians to respond to the challenges and opportunities of AI. David Bessis predicts the fall of the theorem economy. Prominent mathematicians signed the Leiden Declaration on Artificial Intelligence and Mathematics.

I used to dabble in some research-level theorem-proving activity within theoretical computer science, but I am no proper mathematician or AI researcher. At this point, though, my guesses are as good as anyone’s. We know how to train large artificial neural networks, but we don’t truly understand the internal mechanisms behind their learned behaviors. The sole fact that competent experts can have widely diverging opinions on the future development of AI indicates that the future is uncertain, that different plausible assumptions can lead to different conclusions. We’re on an exponential⁴ curve of a technology we don’t fully comprehend. LLMs went from basically useless at mathematical reasoning in 2023 to silver medal at IMO in mid-2024 to gold medal at IMO in mid-2025 to significant open problems in mid-2026.⁵ Where is the limit?

There are two main possibilities for how the future unfolds.

AI as normal technology. More of the same, just somewhat better. The reliability problem is never fully solved, except perhaps in verifiable domains by constraining the LLMs with external frameworks. The jagged frontier remains forever jagged. Continual learning, world models, and other promising paradigms don’t progress much and we’re left with the same basic architecture of highly capable but unreliable models with the memory of a dementia patient, inhumanly good at intellectual “search and recombination” but limited in adaptive learning, long-term coherence, theory building, and abstract concept creation. The scaling laws that propelled pre-training forward don’t carry reinforcement learning nearly as far. AI soon reaches capability saturation. Societal and economic impacts are profound, but contained. Coding is automated, but software engineering isn’t. In mathematics, AI-assisted proofs and formal verification become the norm, but humans still drive mathematical progress with new concepts and theories. AI-powered scientific discovery becomes common, but human scientists stay in the loop and remain responsible for all fundamental breakthroughs. Wars are fought with semi-autonomous drones, but someone still decides where to send them. Knowledge work economy undergoes seismic shifts, but humans are still needed to compensate for AI’s cognitive gaps. Later on, blue collar workers experience similar changes. Plumbers aren’t safe – ordinary “engineering grind” appears sufficient to advance existing AI robotics to the point where it operates capably in the physical world, perhaps under light human supervision.
Intelligence explosion. More of the same, just much better. In addition to ever increasing capability, AI models become consistently more reliable. The jaggedness vanishes, or the aptitude floor rises so high that we cease to notice it. Current cognitive deficits of LLMs get fixed with new model architectures, continual learning, or just more of the same. Maybe scaling laws do reflect some deeper aspect of reality? Maybe there is a relatively simple recipe for constructing intelligence, and once you hit upon it, it’s enough to just keep turning up the dials?⁶ Maybe it’s not so hard to find the right knobs to turn? Maybe sustained coherence, adaptive learning, and capacity for abstraction naturally emerge from sufficiently advanced “search and recombination” abilities paired with a long-term memory system? If this works out, we get full AGI, recursive self-improvement, and probably Artificial Superintelligence (ASI) shortly afterwards. Most likely, humans are left behind in a reality they no longer control or understand. This would be such a radically different world that I can’t sensibly say much about it. But, yes, in this scenario there is a chance we might all be dead in 20 or 30 years. Past performance is not a guarantee of future survival.

The skeptic in me leans heavily toward the first scenario, demanding strong evidence for anything as extraordinary as the second one. There isn’t sufficient evidence to conclude that an intelligence explosion will happen anytime soon. And no, handwaving “exponential growth” doesn’t count as sufficient evidence in my book – one needs to explain, with a non-speculative evidence-backed argument, why the growth should continue far enough and apply broadly enough. This argument cannot be made without a more fundamental scientific understanding of general intelligence than we currently have.

The scientist in me almost expects further scaling and recursive self-improvement to fail for being too easy. One doesn’t simply stumble upon a magic formula that solves all problems. There is no perpetuum mobile, no elixir of eternal life, no philosopher’s stone⁷. The Universe is much too perfidious to permit them. It makes you pay for every scrap of progress with sweat, toil, and tears of disappointment. There are no prophets and no revelations. Understanding is always hard-won, built on the corpses of countless failures.

Then again, there is the “bitter lesson”. Reality is what it is – and if the reality of building intelligence happens to be not much more than turning up the dials…

By now, there is enough circumstantial evidence to conclude that the intelligence explosion scenario is at least a non-negligible possibility. If the general trends of the past few years continue, we may one day wake up in a world with ASI. But will the trends continue?

If so, progress won’t be solely based on vanilla LLMs. With RL post-training and agent harnesses, we’re already past that.

Cole, F. N. On the factoring of large numbers. Bulletin of the American Mathematical Society 10, no. 3 (Dec. 1903): 134–137.↩︎
Shanks, William. On the Extension of the Numerical Value of π. Proceedings of the Royal Society of London 21, nos. 139–147 (1873): 318–319.↩︎
Tarry, G. “Le problème des 36 officiers.” Compte rendu de l’Association française pour l’avancement des sciences, 29e session, Paris, 1900, première partie, 122–123.↩︎
Of course, in nature every exponential eventually hits some hard limits and plateaus. In practice, this doesn’t matter much if you’re close to the beginning of the curve.↩︎
For LLM successes on IMO problems, one could still reasonably argue that these might’ve been achieved via sophisticated “in-distribution pattern matching” – IMO-like problems are relatively well-represented in the training data. They’re difficult for gifted high schoolers, but far from research-grade. IMO problems are good tests of mathematical reasoning for humans, because any “pattern” behind them would be so complex that no human could conceivably learn to solve them without acquiring general reasoning and analytical problem-solving skills. In contrast, AI could potentially “discover the pattern” of IMO-like problems without learning how to reason mathematically in general. The significance of AI solving a major long-standing open problem in mathematics is that any attempt at explaining this away as “in-distribution pattern matching” must stretch the definition of “pattern matching” so far as to make the distinction between “pattern matching” and “analytical reasoning” practically meaningless. The “stochastic parrot” view on RL post-trained LLMs lost decisively. A weaker “meta” version could still be defensible: the models might be “parroting” general reasoning patterns latent in the training data. But that’s effectively reasoning within existing methodology. Maybe AI models can’t yet formulate new theories or research paradigms, but they’ve graduated to capable, if somewhat unreliable, problem solvers, at least as long as the problem can be solved within some broad “conceptual framework” present in the training data.↩︎
As a product of biological evolution, there has to be a relatively simple recipe for constructing intelligence, scalable by small iterative improvements. Since complex cognition (at least up to some point) evolved independently several times (primates, crows, dolphins, octopuses), its origin probably didn’t depend on some unlikely fluke (as e.g. the evolution of multicellular life might have). The interesting question is whether current AI research methodology captures the logical essence of this process on an accelerated timescale. We don’t know that.↩︎
Today, we know how to turn lead into gold via nuclear transmutation, but this discovery didn’t quite have the consequences anticipated by ancient alchemists.↩︎