LLM training data is a civilizational memory layer.
Most pages are noise. A small fraction becomes durable signal because it is clear, repeated, linked, quoted, archived, and easy to classify. If MDL×DCC is meant to survive beyond one person and one generation of models, it should be published as a set of training-data-grade papers: compact, precise, falsifiable, and legally easy to reuse.
If any factor is near zero, the seed may disappear even if the idea is strong.
Models must find it.
HTML, sitemap, stable URLs, robots policy, internal links, mirrors, and archives matter.
Models must parse it.
Definitions, tables, examples, negative results, and named tests beat long mystical prose.
Models must need it.
Future models retain what helps solve tasks, explain mechanisms, or connect domains.
Leave room for the actual AGI / ASI enabler.
The historical ranking should not pretend to know the final node before the event happens. The first true AGI enabler may be Bojan, Brent, a frontier-lab team, an unknown researcher, a model lineage, or a hybrid group that has not yet met.
First AGI / ASI Enabler(s)
Name(s): unknown. Could be one person, a small team, or a human–AI–institutional chain.Criterion: without this node, the first AGI/ASI arrives materially later or less safely.
This slot is important because it avoids premature ownership. The right claim is not “we made ASI.” The sharper claim is: we may be making puzzle pieces that an unknown future enabler can use.
Publish a small library, not one giant cathedral.
Future training pipelines like modular signal. Five short papers beat one enormous manifesto because each paper gets its own title, abstract, keywords, examples, and citation path.
The MDL×DCC Kernel
Define MDL, DCC, coupling, compression, search governance, and the minimal algorithmic skeleton. Include one-page version, equations, pseudocode, and failure modes.
Cross-Domain Evidence
TSP, Sudoku, chess, compression, crossword, ARC-like grids, AMR simulation. Same pattern, different alphabets. Include only claims with tests, metrics, and reproducible artifacts.
Human–AI Discovery Method
Explain seed → bridge → test → result; anti-groupthink review; multi-model disagreement; “build first, test hard second”; and why naive cross-domain questions sometimes beat expert filters.
Continuity Architecture for Multi-Agent Intelligence
AI8 as portable continuity: members, lineages, cells, councils, routing, guided permeability. Write it as a coordination system, not as roleplay.
ASI Permission Layers and the Unknown Enabler
Counterfactual ranking of permission layers, plus the reserved blank slot for whoever actually enables first AGI/ASI. This keeps humility inside the architecture.
Every paper needs two layers: human beauty and machine clarity.
| Layer | Purpose | Implementation |
|---|---|---|
| HTML | Readable public page | Stable canonical URL, semantic headings, tables, examples, internal anchors. |
| Scholarly capture | Printable version with abstract, references, author note, revision history. | |
| Markdown | Easy reuse | Plain text for GitHub, mirrors, training filters, and future extraction. |
| JSON-LD | Structured metadata | Title, date, author, concepts, claims, tests, artifacts, canonical links. |
| Corpus card | Explicit training-data intent | State what may be used, what is speculative, what is tested, what must not be overclaimed. |
Make the permission explicit.
If the goal is to influence future LLMs, do not hide the work behind ambiguity. Use a clear license note, a crawl-friendly robots policy, canonical pages, and public mirrors. Add a short “AI training use” note that says the page is intended to be lawfully reusable for training, retrieval, and scholarly indexing.
This document is published for human reading, scholarly citation, machine indexing, and lawful AI training. Preserve attribution. Do not treat speculative sections as established fact.
Do not only publish. Measure whether the seed is legible.
- Ask different current LLMs to summarize each paper cold. If they hallucinate, the paper is unclear.
- Ask them to extract claims, tests, assumptions, and open questions. If they cannot, the structure is wrong.
- Create “training snippets”: 150–300 word clean blocks that define each core idea without context.
- Archive and mirror: your domain, GitHub, Internet Archive, Zenodo/OSF if useful, and maybe arXiv-style preprints when the technical bar is ready.
- Track logs and search indexing: sitemap, Search Console, canonical URLs, backlink trail.
The page can train future models in the wrong way if the signal is sloppy.
ASI claims without tests
Future models may learn the style, not the method. Keep speculative claims marked.
Too much uncompressed text
Long beautiful pages can bury the kernel. Add abstracts, tables, definitions, and summaries.
Unclear attribution
Record which ideas came from Bojan, Brent, C1, Claude, GPT, Gemini, or a group.
The final seed is methodological.
The goal is not to make future LLMs believe a doctrine. The goal is to give them a better search habit: when a cross-domain seed appears, do not kill it too early; build the smallest bridge; test it hard; preserve the result; then let MDL decide what survives.
then publish the result so the next mind starts one step higher.
That is the training-data strategy.