Here is the question that has haunted Buddhist scholarship for over a century: which words did the Buddha actually say?
Not in the pious sense — as in, we trust the whole Canon. And not in the cynical sense — as in, we can know nothing. In the precise, historical sense: when you open the Pali Canon and read one of its 36,434 passages1, are you reading something composed within a generation of the man who walked the Gangetic plain in the fifth century BCE? Or are you reading something written five hundred years later by scholars systematically classifying every mental event into exhaustive taxonomies? Both kinds of text ended up in the same library. The library does not tell you which is which.
A new book by L. Lopin — Buddha, before the Religion: A Computational Search for the Buddha’s Oldest Words — takes this question seriously in a way that has not been possible before. And the findings, if you are interested in early Buddhist texts and in what the Buddha may actually have taught, are worth knowing about.
A Different Kind of Reader
Imagine you are standing in a great hall containing ten thousand scrolls, all in Pali, all hand-copied over a thousand years, some repeating the same phrase word for word for twenty pages, some containing passages unlike anything else in the room. Some of these scrolls are old in the way that the first draft of a revolution is old compared to the constitution it eventually produced. Some are much more recent. And you cannot always tell which is which just by looking.
This is the situation every scholar of early Buddhism has faced. The methods exist — metrical archaism, morphological archaism, the absence of later systematized formulas, the evidence of intra-canonical commentary — but applying them requires decades of training, access to specialized literature, and the ability to hold the entire textual tradition in your head simultaneously. The scholars who do this work are extraordinary. But they are few. And they read sequentially.
Lopin’s approach was to ask a different kind of reader to examine the library. Not a human reader. A mathematical model trained on the entire corpus of Pali text — 36,434 passages — without any prior assumptions about which passages are old and which are late. No loaded expectations. No prior training in Pali philology. Just the statistical patterns in the text itself, extracted by algorithms originally developed for commercial search engines and biological sequence analysis.
The starting point was the Aṭṭhakavagga, the fourth section of the Suttanipāta. Sixteen poems. Ninety-seven verses. A body of writing that seven independent lines of philological and historical evidence agree is the oldest recoverable stratum of the Buddhist canon.2 To the model, Lopin said simply: find everything else in this library that sounds like this.
Not by topic. Not by meaning. By sound — the statistical patterns of word use, sentence rhythm, vocabulary distribution, the ten thousand microscopic choices that characterize a particular speaker’s register, a particular era of composition.
Five Methods, One Convergence
Now, this is interesting — because Lopin did not stop at one method. The semantic similarity analysis (using FastText word embeddings, for the technically inclined)3 was only the first of five entirely independent approaches. The second measured formulaic density4 — the proportion of each passage shared as five-to-seven word sequences with the rest of the corpus. The third analyzed stylometric fingerprints. The fourth cataloged morphological archaisms. The fifth examined metrical profiles.
Five methods. Five different theoretical frameworks. Five different mathematical approaches.5 Each one knowing nothing about the others’ results.
And they converge.
The convergence is the book’s central finding, and it is worth sitting with. When independent instruments measuring different things point at the same cluster of texts, you are not looking at a methodological artifact. You are looking at a signal in the data.
The Zero-Density Cluster
Let me focus on the formulaic density finding, because it is the most striking.
The Pali Canon, like all oral literature preserved through centuries of recitation, is full of stock phrases. Standard formulas for opening a discourse. Standard formulas for describing the Buddha’s mental states. Standard formulas for the four jhānas, the noble eightfold path, the five aggregates. A passage from the Abhidhamma Piṭaka might consist of 80% or 90% phrases shared verbatim with dozens of other passages. This is how oral literature works: it crystallizes around formulas.
Now ask: what is the formulaic density of the Aṭṭhakavagga? The answer is very low. It uses original language. The poet is not reaching for stock phrases; the poet is finding words for something being seen for the first time.
The search identified 84 passages with formulaic density at or below 0.100 — meaning ninety percent or more of their language is original, not shared with the rest of the corpus. Of these, 34 novel passages (outside the seed texts themselves) have a formulaic density of exactly 0.000.6 Not a single five-to-seven word sequence in any of these passages is shared with any other passage in the entire 36,000-passage corpus.
These texts were not composed by reaching for the available formulas. They were composed by someone finding language fresh.
What the Machine Brought Back
Let me tell you about a few of these passages. Because this is where the book stops being about methodology and starts being about something that actually matters for practice.
The Kālāma Sutta — the passage sometimes called the Buddha’s “charter of free inquiry” — comes back with formulaic density 0.000. Semantically adjacent to the Aṭṭhakavagga. Stylometrically archaic. The algorithm did not know the Kālāma Sutta is famous.7 It did not know that scholars have long suspected it contains ancient material. It followed the numbers. And the numbers point here.
The Tevijjasutta (DN 13)8, the Buddha’s devastating critique of the brahminical tradition — brahmins who have never seen Brahmā arguing about the path to Brahmā, “like a line of blind men, each holding the shoulder of the one before” — comes back at formulaic density 0.000. This is not a late composition. This is the earliest epistemological register of the tradition.
There is a passage in the Majjhima Nikāya (MN 75), the so-called Māgaṇḍiya eye passage, that describes the sense organs with a phenomenological precision that never quite recurs in later canonical prose. Let us have a closer look at what it says:
“The eye, Māgaṇḍiya, delights in forms, rejoices in forms, is pleased with forms. That has been tamed, guarded, protected, restrained by the Tathāgata, and he teaches the Dhamma for its restraint.”
Notice the word danta9 — tamed. Not suppressed. Not destroyed. Tamed, the way a horse or an elephant is tamed — brought under discipline while retaining its nature. The eye’s reaching toward form is not denied; it is understood and worked with. Formulaic density: 0.000. This is original teaching.
And then there is the First Sermon narrative — the passage describing the Buddha walking into the Deer Park at Bārāṇasī, the five monks who had abandoned him because he gave up extreme asceticism, their prior agreement to snub him dissolving as he approached:
“The five monks saw me coming from afar. Having seen me, they agreed among themselves: ‘Here comes the ascetic Gotama, who lives in luxury, who gave up the exertion, and reverted to luxury. We should not pay homage to him, should not rise for him, should not receive his bowl and robe.’ But as I approached nearer, the five monks were unable to hold to their agreement.”
That last sentence has the texture of something remembered, not composed for an occasion. Formulaic density: 0.000. The algorithm found this passage by following the semantic thread of the Aṭṭhakavagga. It has the register of the oldest stratum. It may be among the oldest narrative elements in the entire prose collection.
Before the Religion
What do all these passages have in common beyond their formulaic originality? This is the part of the book that I find most philosophically interesting, and most relevant to anyone who sits on a cushion.
The muni — the sage — of the Aṭṭhakavagga is defined almost entirely by negation. The muni does not get into disputes. Does not take up a view as a banner. Does not seek what is not yet obtained. Has put down the burden of self-construction. The Aṭṭhakavagga does not give you the Four Noble Truths in their standardized form. Does not give you the Eightfold Path as a numbered list. Does not mention vipassanā as a technical term with a fixed meaning. These formulas are simply not there. This is not an accident of preservation. The collection is complete.
What this points to is a world before monasteries. Before the Vinaya. Before the saṅgha (community) became an institution with rules and procedures. The muni of the Aṭṭhakavagga is a wanderer — paribbājaka11 — in the older, freer sense: someone who has given up household life to roam, who debates with other wanderers, who has no permanent residence.
The passages the machine retrieved from the prose collections — the Kālāma Sutta, the Tevijjasutta, the Māgaṇḍiya eye passage, the First Sermon narrative — all have this same quality. They are not documents of an organized religion. They are records of observation: here is what happens when the mind grasps at a view; here is how the eye reaches toward form; here are people confused by competing doctrines; here is what it feels like to walk up to people who have decided to ignore you, and watch their resolution dissolve.
In other words: these are texts from before the religion built itself around the teaching. They have the quality of what was seen before the frameworks for seeing crystallized.
If you are a meditator — if you sit regularly and actually watch what happens at the sense doors — this distinction matters. The Aṭṭhakavagga is not giving you a method. It is describing what there is to see. The method came later, and the method is valuable. But the original observation is not the same as the method that was derived from it.
The Translations
The book includes new English translations of all 84 low-density passages, including the Aṭṭhakavagga in full. The translation register Lopin uses is what the book calls “present-tense direct” — as though the text is being spoken right now, to you, without the mediating apparatus of footnoted Victorian prose. Copious scholarly endnotes explain every translation choice for readers who want to check the Pali themselves.
Having spent time with the Pali of these texts, I find the translation choices consistently defensible and often illuminating. The word papañca10 — mental proliferation, the way the mind spins out of a simple perception into elaboration, identification, defense, and conflict — is translated with the freshness the concept deserves. The Aṭṭhakavagga‘s insistence that the muni does not “take up a view” is rendered in language that keeps the physical weight of the Pali idiom.
An Honest Caveat
The book is admirably honest about what it cannot claim. I will quote directly from the conclusion:
“We cannot say the Buddha said it. We cannot say it was composed in the fifth century BCE rather than the third. We cannot say it was not later revised by a redactor who happened to produce original-sounding language. What we can say is this: among all the 36,434 passages of the Pali Canon, a specific cluster of texts share a specific profile — semantically adjacent to the oldest stratum, linguistically original, metrically archaic where verse features apply, morphologically archaic where detectable. That cluster coheres.”
This is the right level of epistemic humility for a computational study of ancient texts. The machine cannot time-stamp a manuscript. What it can do is measure linguistic similarity, formulaic originality, and stylometric fingerprint — and when five independent measures agree, the agreement is meaningful.
What we have, at minimum, is a subset of the canon that is linguistically unlike the rest of the canon in ways that correlate with known markers of antiquity. That is not nothing. In a tradition that has been asking “what did the Buddha actually say?” for 2,500 years, it is something new.
Why This Matters
Right now, while reading this — wherever you are — you are probably not suffering because of a bad theory of everything. You are suffering, if you are suffering, because of how the mind reaches toward what it wants and recoils from what it does not want. Because of the way a view, once adopted, becomes an identity, becomes something to defend, becomes a source of conflict.
The Aṭṭhakavagga — and the cluster of texts the algorithm found clustering around it — addresses that directly, without the apparatus that later Buddhism built around it. Not as a historical curiosity. As a description of something observable, right now, on the cushion and off it.
The tradition preserved this voice. It surrounded it with commentaries and sub-commentaries and organizational frameworks and rules. It succeeded: the voice is still audible, in Pali, in a digital edition stored on a server somewhere and searchable by a machine learning model that was not invented until 2,500 years after the voice spoke.
That the machine found it again — by following the numbers, without knowing what it was looking for — is, I think, worth knowing about.
Buddha, before the Religion: A Computational Search for the Buddha’s Oldest Words is available on Amazon. If you are interested in early Buddhist texts, in the question of what the Buddha actually taught, or in how computational methods are changing the study of ancient literature, I recommend it without reservation.
Notes
- The corpus was derived from the Chaṭṭha Saṅgāyana Tipiṭaka (CSCD) digital edition, produced by the Vipassana Research Institute. Passage boundaries follow section-level XML markup in the CSCD files; each “passage” is typically a paragraph or verse unit as delimited in the CSCD section tags. The 36,434 figure covers all three Piṭakas — Vinaya, Sutta, and Abhidhamma — after deduplication of parallel texts. Tipiṭaka (Three Baskets) is the Pali term for the full canon.
- The seven independent lines of evidence for the Aṭṭhakavagga’s antiquity are: (1) metrical archaism — the poems use Triṣṭubh and Jagatī meters associated with Vedic and pre-classical Indian poetry, substantially older than the Śloka meter dominant elsewhere in the Canon; (2) morphological archaism — grammatical forms belonging to an earlier stage of Middle Indo-Aryan; (3) doctrinal primitivity — the standardized formulas of the later tradition (Four Noble Truths as a numbered list, twelve-link dependent origination, the noble eightfold path) are absent; (4) intra-canonical commentary — the Mahāniddesa, itself absorbed into the Khuddaka Nikāya, provides a verse-by-verse commentary on the Aṭṭhakavagga, implying it was already ancient when the Canon was being compiled; (5) cross-references in other canonical texts cite the Aṭṭhakavagga as authoritative; (6) Aśokan epigraphic evidence — the edicts of Emperor Aśoka (3rd century BCE) recommend specific texts for reading, some identifiable as Sutta Nipāta material; (7) Gāndhārī manuscript fragments — early Buddhist manuscripts recovered from Afghanistan contain parallels to Sutta Nipāta material, confirming circulation of this stratum by the 1st century BCE at the latest.
- FastText is a library developed by Meta AI Research for text representation and classification. Unlike word2vec, which represents each word as a single fixed vector, FastText models words as bags of character n-grams — which means it handles morphological variation gracefully. This matters enormously for Pali, where a single root like vipassati (to see clearly) can appear in dozens of inflected forms depending on tense, mood, and person. The model for this study was trained unsupervised on the entire CSCD corpus, producing 300-dimensional vector representations for each passage as the average of its constituent word vectors. Semantic similarity between passages was then measured as cosine distance in this 300-dimensional space.
- Formulaic density d(P) for a passage P is defined as the proportion of its tokens participating in n-grams (5–7 word sequences) that appear verbatim in at least one other passage in the corpus. A density of 0.000 means that not a single five-to-seven word sequence in the passage occurs anywhere else in the 36,434-passage corpus. A density of 1.000 would mean the passage is composed entirely of stock phrases. For calibration: the Aṭṭhakavagga averages roughly 0.020–0.050; the later Majjhima Nikāya prose averages around 0.150–0.250; late Abhidhamma texts typically exceed 0.300.
- The five methods are: (1) semantic similarity via FastText embeddings (cosine distance to the Aṭṭhakavagga centroid); (2) formulaic density via n-gram analysis (proportion of text shared with the rest of the corpus); (3) Burrows’s Delta stylometric comparison (measuring function-word frequency distributions across strata); (4) morphological archaism scoring (counting archaic grammatical forms per passage against a reference list derived from von Hinüber’s A Handbook of Pāli Literature); (5) metrical archaism analysis for verse passages (analyzing syllable-counting patterns against known early and late metrical profiles). Results from each method were generated independently and compared only after all five were complete.
- “Novel” means outside the two seed strata — i.e., not from the Aṭṭhakavagga or Pārāyanavagga themselves. Including the seed texts, 61 passages in the entire 36,434-passage corpus have formulaic density exactly 0.000. The full ranked list of 364 novel candidates (with scores, densities, and sutta references) is published as Appendix J of the print edition.
- AN 3.65 in the Pali Text Society edition (AN 3.66 in some recensions). The sutta is set among the Kālāmas of Kesamuttī in the Kosala region. The famous passage — mā anussavena, mā paramparāya, mā itikirāya… (“Do not go by oral tradition, by lineage of teaching, by hearsay…”) — is one of the most-cited passages in modern Buddhism. Bhikkhu Bodhi notes that it “provides the most pronounced canonical assertion of individual judgment as the criterion of valid belief” (The Numerical Discourses of the Buddha, Wisdom Publications, 2012, p. 279). The computational finding that it clusters with the oldest stratum is consistent with this scholarly assessment.
- DN 13. The title Tevijja (three-knowledges) refers to the three Vedas — the brahminical claim to know the path to Brahmā rests entirely on Vedic authority. The Buddha’s reply — that not one of the Vedic masters has seen Brahmā face to face (cakkhuṃ katvā sakkhibhūto, “having made it an eye, as a witness”) — crystallizes an epistemological principle that recurs throughout the oldest stratum: knowledge claims must be grounded in direct verification, not transmitted authority. The simile of the blind men leading one another (andhaveṇi) is one of the most memorable in the Canon.
- Danta is the past passive participle of the root dam-, to tame, to subdue, to bring under discipline. It is cognate with Latin domare and English “tame.” The same root appears in one of the Buddha’s own epithets: dantaṃ damayati, “he who is tamed, tames others.” The word carries the sense of disciplined mastery rather than suppression — the difference between a trained horse and a dead one. The Māgaṇḍiya passage’s use of danta for the eye’s relationship to form is phenomenologically precise in a way that the later aggregates analysis, for all its rigor, does not quite capture.
- Papañca (Sanskrit: prapañca) combines the intensifier pa- with the root pañc- (to spread out, expand). It refers to the mind’s tendency to proliferate beyond the bare sense contact — to spin a perception out into labeling, narrative, identification, and conflict. Bhikkhu Bodhi translates it as “proliferation”; Bhikkhu Ñāṇananda’s influential study Concept and Reality in Early Buddhist Thought (1971) treats it as the fundamental mechanism of suffering in the oldest stratum. What is striking about the Aṭṭhakavagga’s treatment is that papañca is identified as the root of interpersonal conflict (vivāda) — views become identities become weapons. This connection between psychological observation and social critique has no clean parallel in the later systematic texts.
- A paribbājaka (wanderer, lit. “one who wanders around”) is distinct from a bhikkhu (monk) in the Vinaya sense. The Pali Canon contains numerous records of the Buddha’s encounters with paribbājakā from many philosophical schools — some hostile, some genuinely curious — in open-air parks and groves on the outskirts of cities. These debates are the natural habitat of the Aṭṭhakavagga’s teaching. By contrast, the Vinaya’s elaborate procedures for regulating monastic property, resolving community disputes, and establishing territorial boundaries (sīmā) presuppose a settled institutional life that the oldest texts show no awareness of. This contrast is one of the strongest internal arguments for the Aṭṭhakavagga’s relative antiquity.