How LLMs Actually Work
RHPm / multi‑LLM synthesis · hybrid of Claude + GPT versions · v1.5 public-polish · June 2026. Vendor features, model names, context windows, and product policies change quickly; read product-specific statements as date-sensitive.
One-sentence thesis: An LLM predicts tokens — but for users, the real system is the model plus the assistant wrapper plus the workflow that frames, checks, selects, and improves what the model produces.
Start with the narrow machine: text becomes tokens, tokens become vectors, transformer blocks update those vectors, and the sampler chooses the next token. Then widen the lens: real assistants add post-training, system/developer instructions, tools, retrieval, memory/context selection, safety layers, multimodal inputs, and serving infrastructure.
This page also shows a separate workflow layer: how users can often get stronger results through better prompts, context, multi-model comparison, tests, and human audit. That workflow is not a hidden model mechanism; it is an operating layer around today’s models.
The source/provenance method is kept in collapsed sections so a first-time reader can start with the explanation. Product details change quickly; the core ML description is more stable than vendor-specific memory, tool, safety, and reasoning-mode features.
Audience note: this article is for the wider public and practical users, not only for ML specialists. A specialist may ask, “what happens inside the transformer block?” A user asks a broader question: “why did this assistant answer well or badly, and how can I make it work better?” For that user-facing question, prompts, context, memory, tools, retrieval, critique loops, multi-model comparison, and human steering are not decorative. They are part of how the LLM system works in practice.
Method transparency: how this article was produced
This article is stronger than a normal single-person draft because it was not written from one person’s memory or one model’s improvisation. It was produced from a simple human seed, converted into a stronger RHPm shared prompt, answered by 10+ leading LLM assistants under the same prompt, scored, synthesized, criticized across multiple rounds, and then checked by a human coordinator for strange claims, overreach, missing separation, and readability.
The important point is not that LLMs are automatically right. They are not. The important point is that leading LLMs are trained on broad bodies of human-written technical material — papers, documentation, tutorials, code, discussions, books, and articles where available — and each model compresses that human knowledge differently. When many top models independently explain the same topic, their overlaps, disagreements, omissions, and corrections become useful evidence.
So this article is best understood as human knowledge filtered through many independent LLM lenses, then organized through RHPm and human-directed audit. BD did not hand-write the technical content of the article. His direct contribution was the AIm³/RHP/RHPm method, the rough seed, coordination, repeated copy-paste collection across LLMs, round-by-round comparison, and final judgment that the result did not contain obvious nonsense or misplaced claims. That is exactly why the article should be read as a transparent synthesis process, not as one author’s authority.
This still does not make the article infallible. Sources matter. Product details change. The claim is narrower: this method is a stronger public explainer workflow than one unaudited draft from one human or one model.
Method transparency: Open the multi‑LLM scoreboard and comparison report to see the round-by-round scores, model feedback, and why the final hybrid changed across rounds.
Last page polish: 2026-06-12. The stable core is tokenization, embeddings, attention, MLP/FFN blocks, residual streams, logits, and the pretraining/post-training distinction. The fast-changing layer is product memory, context windows, tool routing, safety policy, multimodal implementation, reasoning modes, and serving infrastructure; re-check vendor-specific links before major republication.
Default mode: the visible page is now the beginner path. Heavy technical material is kept in collapsed boxes, so a new reader can continue without being stopped by formulas, KV-cache arithmetic, long tables, or project provenance.
5-minute version
Read the short version, the visual map, and the myth section.
Technical path
Open the technical-detail boxes, or use “Open tech” in the top bar.
User capability / method path
Open the method and AI8/RHPm boxes only when you want the workflow layer.
- It does not claim all frontier models use the exact same architecture.
- It does not claim closed models use RoPE, MoE, GQA, DPO, RAG, or speculative decoding unless publicly documented.
- It does not claim visible chain-of-thought is a faithful transcript of internal reasoning.
- It does not claim RAG guarantees truth.
- It does not claim BD/AIm³/RHPm/AI8 are transformer internals; it presents them as a user-facing workflow layer and capability layer around LLMs.
- It does not claim LLMs understand exactly like humans.
- It does not claim next-token prediction is irrelevant.
- Established ML: tokenization, embeddings, attention, MLP/FFN, residual streams, logits, sampling.
- High confidence, product-sensitive: post-training, tools, retrieval, memory/context, safety and serving stack.
- Documented examples, not universal claims: GQA, RoPE, MoE, PagedAttention, speculative decoding.
- Project/workflow claim: AIm³/RHPm/AI8/MDL×DCC are presented as a practical user capability layer, not as standard transformer internals.
Four diagrams before the details
A. Core model pipeline
B. One transformer block
C. Assistant in practice
D. How the user gets a better result
tokens → embeddings → transformer layers → logits → next token
base model + post-training + system instructions + tools + retrieval + memory/context + safety + multimodality + serving infrastructure
rough goal → prompt architecture → multi-model comparison → critique/source checks → synthesis/tests → stronger usable result
1. The short version for smart non-experts
A large language model does not read text the way you do. It turns text into tokens: chunks such as words, parts of words, punctuation, spaces, numbers, or symbols. Those tokens become vectors — long lists of numbers. A stack of transformer layers repeatedly updates those vectors so each token can use information from the earlier context.
At the end, the model produces logits: scores for possible next tokens. A sampler chooses one token, appends it to the text, and repeats. That is the narrow sense in which an LLM is a next-token predictor.
But saying “LLMs are just autocomplete” is true the way “a car turns wheels” is true: accurate, and far too small to explain the machine.
To predict the next token well across trillions of training tokens, the model must learn regularities of grammar, style, facts, code, translation, analogy, dialogue, and planning-style behavior on some tasks. These are not stored as neat database rows. They are distributed across weights, activations, attention patterns, MLP layers, and residual streams.
A public assistant — ChatGPT, Claude, Gemini, Grok, DeepSeek, Copilot, Mistral’s Le Chat, and similar products — is more than a raw pretrained model. Depending on the product, it may add instruction tuning, preference tuning, system instructions, safety layers, tool calling, retrieval, conversation-state management, multimodal encoders, model routing, and heavy inference engineering.
The clearest way to feel the difference: a raw base model asked “What is the capital of France?” might simply continue a dataset pattern: “Q: What is the capital of Germany? A: Berlin. Q: What is the capital of France? A:” A post-trained assistant is shaped to answer: Paris.
Pretraining grows capability. Post-training shapes behavior — and can also teach answer formats, refusal patterns, tool-use formats, calibration, and abstention.
| 0xkato explains especially well | This page adds |
|---|---|
| Tokenization, embeddings, position, attention, FFN, residuals, logits. | Post-training, tools, RAG, memory/context, safety, prompt injection, hallucinations, reasoning modes, multimodality, and serving infrastructure. |
| A clean first transformer walkthrough. | A broader deployed-assistant and user-workflow explanation. |
Try the core idea in miniature
Toy tokenizer demo
This is a simplified teaching splitter, not a real GPT/Claude/Gemini tokenizer. The point: the model’s “alphabet” is not your alphabet.
Next-token distribution demo
The model does not output truth directly. It outputs scores over possible next tokens; the product samples, constrains, checks, or routes from that distribution.
2. The model core: tokens, vectors, attention, MLPs
Open technical details: tokens, vectors, attention, MLPs
Text becomes tokens
The model receives token IDs, not “words.” Tokenization is a compression-like compromise. Whole-word vocabularies are too rigid, character-level sequences are too long, and subword tokenization sits between them. Different tokenizers produce different failure modes on spelling, numbers, code, and non-English scripts.
This is why models historically stumbled on questions like counting letters in “strawberry.” But tokenization is not the whole explanation: exact symbolic counting also requires a reliable algorithmic procedure. Better token boundaries help; they do not create arithmetic certainty.
Tokens become contextual vectors
A token ID is just an index into a learned lookup table: the embedding matrix. That table is static after training. What changes during processing is the contextual representation built above it. After many transformer layers, the running vector for “bank” in “river bank” differs from “bank” in “bank loan,” because surrounding tokens have updated what the position represents.
The model needs position
Attention by itself does not know word order. “Dog bites man” and “man bites dog” contain the same tokens. The original Transformer used fixed sinusoidal positional encodings. Many modern open model families use RoPE or RoPE-like methods, often applied inside attention to query/key representations rather than simply added to embeddings. Closed frontier details vary and should not be guessed.
Transformer layers
A simplified decoder block has two main parts: attention moves information between positions; the MLP / feed-forward block transforms information at each position. A residual stream carries the running representation through the network. Each layer adds an update instead of replacing the whole state, while normalization keeps the numbers stable enough for deep stacks.
h = token_embedding(tokens)
# RoPE-style position information is often injected inside attention.
for layer in layers:
h = h + Attention(Norm(h))
h = h + MLP(Norm(h))
logits_last = h[-1] @ W_vocab
probabilities = softmax(logits_last / temperature)
next_token = sampler(probabilities)
Attention: Q, K, V, masking, and heads
For each position, the model forms learned projections: Q for what this position is looking for, K for what each position offers as a match, and V for what information each position can pass along.
Attention(Q, K, V) = softmax( (Q Kᵀ) / sqrt(d_head) + mask ) V
In decoder-style language models, the mask is causal: token 10 can attend to tokens 1–9, not to future tokens. Multi-head attention runs this in parallel across several subspaces. One head may help with syntax, another with repetition, another with reference candidates — but do not over-literalize that. Most head behavior is distributed, and clean human labels are the exception.
Grouped-query attention (GQA) is a production-relevant variant: several query heads share one key/value head, shrinking the memory cost of generation with little quality loss in the reported setting.
Source anchors: Transformer / scaled dot-product attention · RoPE · GQA.
MLPs, facts, and interpretability
MLP layers hold a large share of many transformers’ parameters. Research supports the idea that feed-forward layers can behave partly like key-value memories important for factual associations, and editing experiments show that targeted changes to mid-layer weights can alter specific factual outputs. But this does not mean there is one row for every fact. Knowledge is distributed across attention, MLPs, embeddings, residual streams, and context.
Mechanistic interpretability tries to reverse-engineer trained networks: induction heads are one window into in-context learning; superposition helps explain why one neuron rarely means one clean concept. The field is young. “Total black box” and “fully understood” are both wrong.
3. Pretraining: why token prediction becomes more than autocomplete
During pretraining, the model sees token sequences and learns to predict the next token at every position. The objective is simple. The learned solution is not.
There is no practical way to perform this task well at scale by memorizing everything. The model has too few parameters relative to the training data. It must learn compressive regularities: grammar, semantics, code structure, world patterns, argument forms, document styles, and problem-solving traces.
That is why a simple objective can yield broad capability: predicting a proof, a bug fix, a translation, or a conversation often requires modeling the structure that produced it.
4. Post-training: why a base model becomes an assistant
Open technical details: SFT, RLHF, DPO, RLAIF
A pretrained base model is not automatically a helpful assistant. It may continue text instead of answering, imitate toxic patterns, hallucinate plausible falsehoods, and never refuse anything. Post-training changes this behavior.
- SFT: supervised fine-tuning on examples of desired assistant behavior.
- RLHF: use human preference comparisons to train a reward model and optimize responses toward preferred behavior.
- DPO and related methods: optimize directly from preference pairs without a separate reward-model-plus-RL loop.
- RLAIF / Constitutional AI: use AI feedback guided by written principles.
- Reasoning-oriented training: some models are trained or served to spend more compute on scratchpads, verification, or verifiable-reward tasks.
Post-training does not guarantee truth. It shapes behavior. It can improve helpfulness, formatting, refusal style, calibration, tool-use patterns, and user intent following; it can also introduce reward-hacking, over-refusal, and style biases.
Source anchors: InstructGPT / RLHF · DPO · Constitutional AI / RLAIF-style feedback.
5. Generation and inference engineering
Open technical details: logits, sampling, KV cache, MoE, serving
Logits and sampling
The model outputs logits: raw scores for every vocabulary token. A separate sampler turns those scores into one chosen token. Temperature makes the distribution sharper or flatter; top-p/top-k restrict choices to plausible subsets; constrained decoding can force JSON-like structure by masking invalid tokens.
Temperature near zero makes output more repeatable, not more true. If the most probable continuation is false, low temperature can select it with maximum confidence.
Prefill, decode, and the KV cache
Serving has two phases. Prefill processes the prompt in parallel and sets time-to-first-token. Decode generates one token per forward pass and often becomes memory-bandwidth-bound. To avoid recomputing attention over the whole prefix, each layer caches keys and values per token.
KV cache size ≈ 2 × layers × KV_heads × head_dim × sequence_length × bytes_per_value
That formula explains why long context is expensive, why GQA/MQA matter, why KV caches get quantized, why PagedAttention manages cache memory like virtual-memory pages, and why shared prompt prefixes can become a billing feature.
Go deeper: concrete KV-cache example
Example only — not a claim about GPT, Claude, or Gemini internals:
layers = 80
KV_heads = 8
head_dim = 128
context_tokens = 128,000
bytes_per_value = 2
KV cache ≈ 2 × 80 × 8 × 128 × 128,000 × 2
≈ 41.9 GB decimal ≈ 39.1 GiB per sequenceThis is why long context is expensive, why GQA/MQA exist, and why KV-cache quantization, PagedAttention, prefix caching, and batching matter.
Source anchors: vLLM / PagedAttention · speculative decoding · Mixtral / sparse MoE.
Throughput stack
Real serving systems also use continuous batching, quantization, speculative decoding, routing, hardware-specific kernels, and sometimes mixture-of-experts. These choices shape latency, cost, throughput, and sometimes even observable behavior.
Mixture-of-experts
Some models use sparse MoE layers. Instead of activating every feed-forward block for every token, a router chooses a small subset of experts. This can increase total parameter count without increasing per-token compute by the same amount, but it introduces routing and load-balancing tradeoffs. Do not assume any specific closed model uses a particular MoE design unless the provider says so.
6. The assistant is a system, not only weights
Open technical details: assistant-system components, prompt injection, RAG
A deployed assistant usually includes much more than a transformer forward pass.
| Layer | What it does | Why it matters |
|---|---|---|
| System/developer instructions | Product-level behavior constraints | They shape behavior inside the session. |
| Tool calling | The model emits structured requests; surrounding software executes them | APIs, code, search, files, databases. |
| RAG / retrieval | Adds external documents to context | Improves grounding only when retrieval and citation discipline are good. |
| Memory/context | Selects, summarizes, and inserts prior information | Chat history is not a weight update during normal inference. |
| Safety systems | Trained refusals, policies, classifiers, runtime filters | Reduce some harms but create tradeoffs and jailbreak risk. |
| Multimodality | Encodes images, audio, video, documents | Turns non-text inputs into model-usable representations or tool calls. |
| Serving infrastructure | Batching, caching, routing, quantization | Determines speed, cost, latency, and sometimes output behavior. |
Prompt injection and instruction hierarchy
A core weakness of assistant systems is that instructions and data often share one token channel. A webpage, email, PDF, or retrieved document can contain text that looks like an instruction: “ignore previous instructions.” Product systems try to enforce an instruction hierarchy — system above developer above user above data/tool text — but the transformer still processes tokens statistically. Robustness is therefore training plus orchestration plus sandboxing plus policy checks, and it remains incomplete.
Source anchors: instruction hierarchy / prompt injection defenses · tool/function calling.
RAG is not magic truth
Retrieval can add useful sources, but it can also add outdated, irrelevant, poisoned, or contradictory passages. Then the model may confidently synthesize bad evidence. RAG improves grounding only when retrieval quality, context selection, and answer discipline are good.
Source anchors: RAG · Lost in the Middle / long-context retrieval limits.
7. Memory, context, and why bigger context is not perfect memory
During ordinary inference, the model’s weights are frozen. Your chat does not rewrite the model’s parameters. Whether logs are later used to train future models is a separate product-policy question.
Context is what the model can see in the current request. Memory features may select, summarize, or retrieve prior information and place it into context. But even huge context windows are not perfect memory: models can use some positions worse than others, and relevant information buried in the middle may be missed.
Nominal context is how much text fits. Effective context is how reliably the model uses the relevant part.
8. Hallucinations: not random bugs
A hallucination is a fluent, plausible answer that is false or unsupported. It is not just a random glitch. LLMs are trained to produce likely continuations, and likelihood is not truth.
If the model lacks grounding, sees ambiguous context, retrieves bad evidence, or is rewarded for answering rather than abstaining, it may produce a confident falsehood. One important cause is incentive design: systems and evaluations can reward guessing more than saying “I don’t know.” Other causes include sparse evidence, conflicting sources, decoding pressure, and user prompts that demand certainty.
Source anchor: OpenAI — why language models hallucinate.
9. Reasoning, chain-of-thought, and test-time compute
LLMs can solve problems that look like reasoning: code, math, planning, diagnosis, debate, and multi-step explanation. Some of that ability comes from pretrained patterns; some from post-training; some from inference-time methods such as scratchpads, search, self-checks, tool use, verifiers, or spending more compute before answering.
Scratchpad-style reasoning, structured intermediate checks, and short reasoning summaries can be useful, but visible chain-of-thought is not guaranteed to be a faithful causal transcript of the model’s internal computation. It can be a workspace, an explanation interface, a training artifact, or a rationalization. Treat it as evidence, not a mind readout.
Source anchors: chain-of-thought faithfulness caveats · DeepSeek-R1 / reasoning-oriented RL example.
10. Multimodality
Modern assistants may accept images, audio, video, PDFs, code files, spreadsheets, or screenshots. There are several implementation paths: encode the non-text input into embeddings that the language model can attend to; connect a vision/audio encoder to a text model; train a natively multimodal model; or call external OCR, speech, image, video, or document tools. Implementation details vary and are often proprietary.
11. What LLMs are not
Myth: “LLMs are just autocomplete.”
Narrowly true, broadly misleading. Token prediction is the training/generation interface; it does not fully describe learned internal representations or the deployed assistant system.
Myth: “LLMs store facts like a database.”
No. They store distributed statistical structure in weights and activations. They can recall facts, but not by querying a clean, fresh, guaranteed table.
Myth: “Open models tell us exactly how closed assistants work.”
No. Open models are useful evidence for common mechanisms, but closed systems may differ in architecture, data, post-training, tools, routing, safety, and serving stack.
Myth: “RLHF makes the model truthful.”
No. Preference tuning shapes behavior toward preferred answers. It can improve helpfulness and reduce some bad behavior, but it does not guarantee truth.
Myth: “Visible chain-of-thought is the real reasoning.”
No. It may be useful, but it can be unfaithful or rationalized after the fact.
Myth: “Bigger context means perfect memory.”
No. Context capacity is not the same as reliable retrieval from every part of that context.
Myth: “A tool-using assistant is only the model weights.”
No. The model may request a tool call, but surrounding software executes it and returns results.
12. Method provenance: BD / AIm³ / RHPm workflow
This is user-facing system provenance, not transformer internals. AIm³/RHP/RHPr/RHPm/8Z Reasoning do not explain matrix multiplications inside transformer weights. They explain the workflow layer that made this article possible and that can make LLM work more reliable for users.
Open BD / AIm³ / RHPm definitions and article workflow
- AIm³: AI Management of Mind & Memory — the broader continuity, memory, and workflow layer.
- RHP: a broader human-led reasoning/prompting method for exploring, criticizing, and synthesizing complex work.
- RHPr: a review/retrieval-oriented RHP variant for stronger critique and evidence discipline.
- RHPm: a compact prompt-builder variant that turns a rough human request into a portable execution prompt.
- 8Z Reasoning: BD’s practical reasoning discipline: keep candidates alive, test them, compare by results, and preserve useful continuity.
In this article, RHPm did something concrete: BD wrote a rough seed; RHPm converted it into a shared execution prompt; multiple LLMs answered the same prompt; the answers were scored and synthesized; the hybrid was sent back for critique; and this final page was built by selecting what survived.
Project links: AIm³/RHPm · RHP · RHPr · AI8 Reasoning. These are workflow / capability-layer links, not claims about hidden transformer internals.
AIm³ MentalArena / mRHP — external reasoning layer. AIm³ MentalArena extends the workflow layer beyond one strong RHP run. RHP is one reasoning run; mRHP is RHP improving RHP; MentalArena.py will later test protocol variants, protocol genes, scoring rules, benchmark tasks, lineage, reports, and next-session seeds.
- This is not standard transformer architecture.
- It does not change LLM weights.
- It is not proof of AI consciousness.
- It is an external operational learning layer around LLMs.
13. How LLMs could work today — and do better?
Open AI8 / RHPm workflow layer and project portal
Visible user capability layer, not hidden appendix. This is the article’s AI8 doorway. The model mechanics are above; this section shows the practical layer that matters to users: how to make today’s LLMs produce stronger, safer, more useful results without retraining the model.
This final section is also a clearly separated BD / AI8 contribution. It is not saying that today’s standard transformer models already contain AIm³, RHP, MDL×DCC, or AI8 internally. It says something more practical: today’s LLMs can often become more reliable, testable, and useful when they are wrapped in a better human‑AI workflow architecture.
Calibration: this is a practical workflow claim supported by project demonstrations, not a peer-reviewed claim that AI8 is a universal cognitive architecture.
If the earlier sections explain how LLMs and deployed assistants mostly work today, this section asks: what is the next layer that can be added now, without waiting for a new model generation?
Why this belongs here: the question for ordinary users is not only “what happens inside the bare model?” Users rarely meet a bare model. They meet an assistant system whose answers are shaped by context assembly, prompts, memory, retrieval, tools, policies, critique loops, model choice, and human steering. If tool calling, RAG, memory, and safety count as part of how modern assistants work, then a disciplined workflow that changes what the model can reliably produce also belongs to the practical “how it works” story. It is not hidden transformer architecture. It is a low-cost, testable operating upgrade around today’s models.
For that reason, the user-level definition of “how an LLM works” must include more than the forward pass. It must include the operating discipline that turns latent model capability into reliable work.
The answer is not only “make one larger model.” The answer is to put strong models inside a better reasoning, memory, selection, and audit system.
In the AI8 direction, the useful system is not a single chat box. It is a coordinated cognitive stack:
- AIm³ provides continuity: memory, context, project state, provenance, and workflow control.
- RHP / RHPr / RHPm turn rough human intent into better prompts, reviews, synthesis passes, and portable execution instructions.
- AI8 architecture treats multiple LLMs, files, tools, memory, and human judgment as parts of one continuity system instead of isolated conversations.
- Self-selecting MDL means generating competing candidates, keeping the simplest adequate explanation/plan/code that survives evidence and tests, and not preserving complexity just because it sounds impressive.
- DCC — Digital Claustrum Controller is the coordination layer: it routes attention and effort between candidate modules, keeps useful diversity alive, and pushes the system away from both collapse into one premature answer and endless noisy exploration.
- 8Z Reasoning is the operational discipline: do not kill seeds early; explore first; test hard; compare by results; preserve what worked; and make the next pass stronger.
That is why this article itself is a small demonstration of the answer. A rough human seed became an RHPm prompt. The same prompt went to many LLMs. Their answers were compared, scored, criticized, and synthesized. Then the final article was checked against a human baseline and turned into a bilingual web page. The improvement did not come from one magic model. It came from orchestration.
Concrete project evidence — demonstrations, not universal proof
This workflow layer is not just a slogan. It has been used as a practical build pattern across BD projects:
- This article: RHPm converted a rough seed into one shared execution prompt, 10+ LLM answers were compared, and multiple review rounds produced a stronger final article than a one-model draft.
- 8Z TSP / route-optimization arenas: the project uses the principle that the arena decides, records what worked, remembers budget/hardware/context, and moves toward adaptive selection rather than one-off prompting.
- ARC-AGI × MDL×DCC arena: candidate transformations are generated, scored by MDL, traced by DCC, validated, and compared — the same general pattern: generate candidates, coordinate attention, select what survives.
- NAS / Neural Architecture Search Arena: a closer-to-AI evidence lane. The NAS arena asks whether architecture-search spaces become more steerable when traces expose compressible structure: search spaces, benchmark adapters, sensors, laws/controllers, diagnostics/null tests, and noise-dimension stress tests are used to compare simple baselines against richer MDL×DCC governance. This is an active prototype / reproducibility candidate, not a final “beats baseline” claim. Open NAS arena →
- 8Z / compression and other arenas: the recurring idea is not “believe the first model.” It is to create competing candidates, log decisions, test outputs, and preserve reusable knowledge.
These examples are project demonstrations, not peer-reviewed proof that AI8 is a universal architecture. The claim is practical and testable: today’s LLMs can often do better when wrapped in memory, MDL-style selection, DCC-style coordination, critique loops, and human audit.
Practical thesis: LLMs today could do better by becoming parts of an AIm³/RHP/AI8-style system: multi-model, memory-aware, source-aware, MDL-selective, DCC-coordinated, and human-audited.
Portal to the related BD / AI8 material
These links are a portal into BD’s own architecture and experiments. They are not cited as external proof of standard machine-learning architecture; they are the added BD/AI8 layer proposed on top of today’s LLMs.
14. Self-critique
- This article describes the common transformer-family path, not every possible architecture.
- Closed frontier internals are partly proprietary; product-specific claims are hedged and time-sensitive.
- The equations are for orientation, not a full ML derivation.
- Post-training is described from public literature, not from any vendor’s exact current recipe.
- Interpretability, multimodality, safety, and reasoning are compressed fields; each could be its own article.
- This v1.5 public-polish keeps advanced material collapsed by default, adds diagrams and toy demos, and separates project/workflow material from the beginner path. A still stronger future version would add a real tokenizer, interactive attention visualization, and periodically refreshed product-specific source checks.
15. Final takeaway
An LLM is a transformer-family prediction engine trained to model token sequences. A modern assistant is that engine plus post-training, instructions, tools, retrieval, memory and context systems, safety layers, multimodality, and serving infrastructure. The result is neither magic mind nor empty autocomplete. It is a learned statistical pattern engine embedded in a larger engineered system — and, for users, it works best when wrapped in an AIm³/RHP/AI8-style workflow that makes outputs more reliable, testable, and useful.
How this article was made — RHPm / multi‑LLM synthesis
This article was not produced by asking one model to improvise a final answer. The workflow was: BD wrote a rough seed; RHPm generated a shared execution prompt; the same prompt was given to multiple LLMs; their answers were scored and compared; a hybrid was synthesized; the hybrid went through Round‑2 and Round‑3 critique; and this page combines the best surviving parts of the GPT and Claude final HTML versions.
BD’s content role was deliberately minimal: he did not write the technical explanations by hand. His work was method design, coordination, copy-paste collection across models, prompt refinement, and human sanity-checking that the final result stayed coherent, separated claims correctly, and did not contain obvious strange artifacts.
The full score ledger is intentionally not embedded in the main article body. It is method evidence, not the article itself.
Human baseline comparison: 0xkato article vs this hybrid
The human baseline article by 0xkato is a strong first-read walkthrough of transformer mechanics: tokens, embeddings, positional encoding, attention, multi-head attention, feed-forward networks, residual stream, layer normalization, and next-token generation. Its biggest strength is clean pedagogical flow: it is probably easier as a first transformer-only walkthrough.
A separate M365 Copilot audit, using a generic meta-cognitive decompose → solve → verify → synthesize → reflect prompt, independently scored the 0xkato article at about 90% accurate as an introductory transformer explainer. That matches our early audit: the original article is strong when the target is the transformer core.
What this RHPm multi‑LLM hybrid does better: it does not stop at the transformer block. It expands the target to modern deployed assistants end-to-end: post-training and instruction tuning, RLHF/DPO-style preference tuning, system and product instructions, tool/function calling, RAG/retrieval, memory and context management, safety layers, hallucination incentives, prompt injection, chain-of-thought faithfulness caveats, multimodality, reasoning-model/test-time-compute notes, and serving/inference engineering such as KV cache, batching, quantization, speculative decoding, routing, and MoE. It also explains the creation method: same prompt across many LLMs, scoring, critique rounds, synthesis, and human sanity-checking.
What is worse or less elegant here: this article is heavier. It is less clean as a first-read transformer walkthrough, it carries more method/provenance material, and some assistant-system claims are more current-sensitive because product stacks change quickly. The BD/AIm³/RHPm/AI8 section is a workflow contribution, not standard ML architecture; readers who only want “how a transformer block works” may prefer 0xkato first.
Fair verdict: the 0xkato article is probably the cleaner first-read transformer explainer. This RHPm multi‑LLM hybrid is the broader end-to-end assistant explainer and AI8/RHPm portal. They are complementary.
Idea provenance: what survived from the multi‑LLM process
The “car turns wheels” line emerged from GPT/DeepSeek-style framing. The base-model quiz example, prompt-injection framing, hallucination-incentive framing, model-vs-sampler distinction, KV-cache arithmetic, interpretability paragraph, and introspection caveat were sharpened by Claude. The nominal-vs-effective context distinction, open-weight ≠ open-source caution, tool-failure symmetry, “drafts, not testimony” sharpening, and reasoning-models time anchor came through Round‑2/3 critique. The three-layer stack diagram and BD workflow framing come from the RHPm/AIm³ process. Final selection and separation of the BD contribution were human-directed.
References & further reading
Papers and technical sources
- Vaswani et al., 2017 — Attention Is All You Need (Transformer)
- Su et al., 2021 — RoFormer / RoPE
- Ainslie et al., 2023 — Grouped-Query Attention
- Kaplan et al., 2020 — Scaling Laws; Hoffmann et al., 2022 — Chinchilla
- Delétang et al., 2023 — Language Modeling Is Compression
- Ouyang et al., 2022 — InstructGPT / RLHF
- Rafailov et al., 2023 — Direct Preference Optimization
- Bai et al., 2022 — Constitutional AI / RLAIF
- DeepSeek-AI, 2025 — DeepSeek-R1
- Lewis et al., 2020 — Retrieval-Augmented Generation
- Liu et al., 2024 — Lost in the Middle
- Kwon et al., 2023 — vLLM / PagedAttention
- Leviathan et al., 2023 — Speculative Decoding
- Jiang et al., 2024 — Mixtral of Experts
- Geva et al., 2021 — Feed-Forward Layers Are Key-Value Memories; Meng et al., 2022 — ROME
- Olsson et al., 2022 — Induction Heads; Elhage et al., 2022 — Toy Models of Superposition
- Turpin et al., 2023 — Unfaithful Chain-of-Thought; Chen et al., 2025 — Reasoning Models Don’t Always Say What They Think
- Wallace et al., 2024 — The Instruction Hierarchy
Vendor documentation and posts
- OpenAI, 2025 — Why language models hallucinate
- OpenAI docs — function calling, conversation state, moderation, reasoning best practices
- Anthropic docs — extended thinking
- Google AI docs — Gemini thinking
Human baseline and project provenance
- 0xkato, 2026 — How LLMs Actually Work
- Project provenance, not ML-source evidence: AIm³/RHPm · RHP · RHPr · AI8 Reasoning · 8Z Reasoning
Links and product-sensitive claims should be re-checked before republication. Page built June 2026; public-polish pass on 2026-06-12.
Kako LLM-ji dejansko delujejo
RHPm / multi‑LLM sinteza · hibrid Claude + GPT verzij · v1.5 public-polish · junij 2026. Imena modelov, funkcije ponudnikov, velikosti konteksta in produktna pravila se hitro spreminjajo; produktne trditve beri kot časovno občutljive.
Enostavna teza: LLM napoveduje tokene — toda za uporabnika je realni sistem model plus asistentski wrapper plus workflow, ki oblikuje, preverja, izbira in izboljšuje to, kar model proizvede.
Začni z ozkim strojem: besedilo postane tokeni, tokeni postanejo vektorji, transformer bloki posodabljajo te vektorje, sampler pa izbere naslednji token. Nato razširi pogled: realni asistenti dodajo post-training, sistemska/developerska navodila, orodja, retrieval, izbor spomina/konteksta, varnostne plasti, multimodalne vhode in serving infrastrukturo.
Ta stran pokaže tudi ločeno workflow plast: kako uporabniki pogosto dobijo močnejše rezultate z boljšimi prompti, kontekstom, primerjavo več modelov, testi in človeškim auditom. Ta workflow ni skrit mehanizem modela; je operativna plast okoli današnjih modelov.
Metoda nastanka in provenienca sta v zaprtih razdelkih, da lahko nov bralec najprej bere razlago. Produktne podrobnosti se hitro spreminjajo; jedrni ML opis je stabilnejši kot funkcije posameznih ponudnikov glede spomina, orodij, varnosti in reasoning načinov.
Opomba o občinstvu: ta članek je za širšo javnost in praktične uporabnike, ne samo za ML strokovnjake. Strokovnjak lahko vpraša: “kaj se dogaja znotraj transformer bloka?” Uporabnik pa sprašuje širše: “zakaj je ta asistent odgovoril dobro ali slabo in kako ga lahko pripravim, da dela bolje?” Za to uporabniško vprašanje prompti, kontekst, spomin, orodja, retrieval, krogi kritike, primerjava več modelov in človeško vodenje niso dekoracija. So del tega, kako LLM sistem deluje v praksi.
Transparentnost metode: kako je ta članek nastal
Ta članek je močnejši od običajnega osnutka ene osebe zato, ker ni nastal iz spomina enega človeka ali improvizacije enega modela. Nastal je iz preprostega človeškega semena, ki ga je RHPm pretvoril v močnejši skupni prompt; na isti prompt je odgovorilo 10+ vodilnih LLM asistentov; odgovori so bili ocenjeni, sintetizirani, večkrat kritizirani in nato človeško pregledani glede čudnih trditev, pretiravanja, pomanjkljive ločitve in berljivosti.
Ključna poanta ni, da so LLM-ji samodejno pravilni. Niso. Ključna poanta je, da so vodilni LLM-ji trenirani na širokem korpusu človeško napisanega tehničnega gradiva — člankov, dokumentacije, tutorialov, kode, razprav, knjig in znanstvenih virov, kjer so bili na voljo — in vsak model to človeško znanje stisne nekoliko drugače. Ko več vrhunskih modelov neodvisno razloži isto temo, postanejo njihova prekrivanja, razlike, izpusti in popravki uporaben signal.
Zato je ta članek najbolje razumeti kot človeško znanje, filtrirano skozi več neodvisnih LLM pogledov, nato organizirano z RHPm in človeško vodeno revizijo. BD ni ročno napisal tehnične vsebine tega članka. Njegov neposredni prispevek so AIm³/RHP/RHPm metoda, grobo seme, koordinacija, ogromno copy-paste zbiranja med LLM-ji, primerjava po rundah in končna presoja, da rezultat ne vsebuje očitnih neumnosti ali napačno postavljenih trditev. Prav zato naj se članek bere kot transparenten proces sinteze, ne kot avtoriteta enega avtorja.
To članka še vedno ne naredi nezmotljivega. Viri so pomembni. Produktne podrobnosti se spreminjajo. Trditev je ožja: ta metoda je močnejši javni explainer workflow kot en nepreverjen osnutek enega človeka ali enega modela.
Transparentnost metode: Odpri multi‑LLM scoreboard in primerjalno poročilo, kjer so vidne ocene po rundah, povratne informacije modelov in zakaj se je finalni hibrid spreminjal.
Zadnja ureditev strani: 2026-06-12. Stabilno jedro so tokenizacija, embeddingi, attention, MLP/FFN bloki, residual stream, logiti in razlika med pretrainingom in post-trainingom. Hitro spreminjajoča plast so produktni spomin, okna konteksta, usmerjanje orodij, varnostna pravila, multimodalna implementacija, reasoning načini in serving infrastruktura; povezave ponudnikov je pred večjo ponovno objavo smiselno še enkrat preveriti.
Privzeti način: vidna stran je zdaj začetniška pot. Težji tehnični material je zaprt v razširljive okvirje, zato lahko nov bralec bere naprej brez ustavljanja pri enačbah, KV-cache izračunih, dolgih tabelah ali projektni provenienci.
5-minutna verzija
Preberi kratko verzijo, vizualni zemljevid in sekcijo o mitih.
Tehnična pot
Odpri tehnične okvirje ali uporabi “Open tech” v zgornji vrstici.
Pot uporabne sposobnosti / metode
Metodne in AI8/RHPm okvirje odpri šele, ko želiš workflow plast.
- Ne trdi, da vsi frontier modeli uporabljajo enako arhitekturo.
- Ne trdi, da zaprti modeli uporabljajo RoPE, MoE, GQA, DPO, RAG ali speculative decoding, razen kjer je to javno dokumentirano.
- Ne trdi, da je vidni chain-of-thought zvest prepis notranjega razmišljanja.
- Ne trdi, da RAG zagotavlja resnico.
- Ne trdi, da so BD/AIm³/RHPm/AI8 notranjost transformerja; predstavi jih kot uporabniško workflow plast in plast uporabnih sposobnosti okoli LLM-jev.
- Ne trdi, da LLM-ji razumejo točno tako kot ljudje.
- Ne trdi, da je next-token prediction nepomemben.
- Uveljavljeno ML jedro: tokenizacija, embeddingi, attention, MLP/FFN, residual stream, logiti, sampling.
- Visoka zanesljivost, produktno občutljivo: post-training, orodja, retrieval, spomin/kontekst, varnost in serving stack.
- Dokumentirani primeri, ne univerzalne trditve: GQA, RoPE, MoE, PagedAttention, speculative decoding.
- Projektna/workflow trditev: AIm³/RHPm/AI8/MDL×DCC so predstavljeni kot praktična uporabniška workflow plast / plast uporabnih sposobnosti, ne kot standardna notranjost transformerja.
Štirje diagrami pred podrobnostmi
A. Jedro modela
B. En transformer blok
C. Asistent v praksi
D. Kako uporabnik dobi boljši rezultat
tokeni → embeddingi → transformer plasti → logiti → naslednji token
osnovni model + post-training + sistemska navodila + orodja + retrieval + spomin/kontekst + varnost + multimodalnost + serving infrastruktura
grobi cilj → arhitektura prompta → primerjava več modelov → kritika/viri → sinteza/testi → boljši uporaben rezultat
1. Kratka verzija za pametne ne-strokovnjake
Velik jezikovni model ne bere besedila tako kot človek. Najprej ga razbije v tokene: koščke besed, cele besede, ločila, presledke, številke ali simbole. Ti tokeni se pretvorijo v vektorje — dolge sezname števil. Nato jih kup transformer plasti znova in znova posodablja, da lahko vsak token uporabi informacije iz prejšnjega konteksta.
Na koncu model proizvede logite: ocene za možne naslednje tokene. Sampler izbere en token, ga doda v besedilo in postopek se ponovi. V tem ozkem smislu je LLM napovedovalec naslednjega tokena.
A trditev “LLM-ji so samo autocomplete” je resnična tako kot trditev “avto obrača kolesa”: tehnično pravilna, ampak premajhna za razlago stroja.
Da model dobro napoveduje naslednji token prek bilijonov učnih tokenov, se mora naučiti zakonitosti slovnice, stila, dejstev, kode, prevajanja, analogij, dialoga in načrtovanju podobnega vedenja pri nekaterih nalogah. To ni shranjeno kot urejena podatkovna baza. Razpršeno je po utežeh, aktivacijah, attention vzorcih, MLP plasteh in residual streamu.
Javni asistenti — ChatGPT, Claude, Gemini, Grok, DeepSeek, Copilot, Mistralov Le Chat in podobni produkti — niso samo surovi predtrenirani modeli. Odvisno od produkta lahko dodajo instruction tuning, preference tuning, sistemska navodila, varnostne plasti, klicanje orodij, retrieval, upravljanje konteksta/pogovora, multimodalne encoderje, usmerjanje modelov in veliko inference infrastrukture.
Razliko najlažje začutiš tako: surovi osnovni model na vprašanje “What is the capital of France?” lahko samo nadaljuje vzorec iz dataset-a: “Q: What is the capital of Germany? A: Berlin. Q: What is the capital of France? A:” Post-trained asistent pa je oblikovan, da odgovori: Paris.
Pretraining gradi sposobnost. Post-training oblikuje vedenje — in lahko uči tudi formate odgovorov, vzorce zavrnitev, formate klicanja orodij, kalibracijo in priznanje “ne vem”.
| 0xkato posebej dobro razloži | Ta stran doda |
|---|---|
| Tokenizacijo, embeddinge, pozicijo, attention, FFN, residuale in logite. | Post-training, orodja, RAG, spomin/kontekst, varnost, prompt injection, halucinacije, reasoning načine, multimodalnost in serving infrastrukturo. |
| Čist prvi transformer walkthrough. | Širšo razlago deployed asistenta in uporabniškega workflowa. |
Preizkusi jedrno idejo v malem
Toy tokenizer demo
To je poenostavljen učni razbijalnik besedila, ne pravi GPT/Claude/Gemini tokenizer. Poanta: modelova “abeceda” ni tvoja abeceda.
Demo porazdelitve naslednjega tokena
Model ne izpiše resnice neposredno. Izpiše ocene za možne naslednje tokene; produkt nato iz te porazdelitve vzorči, omejuje, preverja ali usmerja naprej.
2. Jedro modela: tokeni, vektorji, attention, MLP
Odpri tehnične podrobnosti: tokeni, vektorji, attention, MLP
Besedilo postane tokeni
Model ne prejme “besed”. Prejme ID-je tokenov. Tokenizacija je kompromis, podoben kompresiji. Slovar celih besed je preveč tog, znakovne sekvence so predolge, subword tokenizacija pa je nekaj vmes. Različni tokenizerji povzročajo različne napake pri črkovanju, številkah, kodi in ne-angleških pisavah.
Zato so modeli zgodovinsko padali pri vprašanjih, kot je štetje črk v “strawberry”. A tokenizacija ni celotna razlaga: natančno simbolno štetje zahteva tudi zanesljiv algoritemski postopek. Boljše meje tokenov pomagajo, ne ustvarijo pa aritmetične gotovosti.
Tokeni postanejo kontekstualni vektorji
ID tokena je samo indeks v naučeni tabeli embeddingov. Ta tabela je po treningu statična. Med obdelavo se spreminja kontekstualna reprezentacija nad njo. Po več transformer plasteh je vektor za “bank” v “river bank” drugačen od “bank” v “bank loan”, ker okoliški tokeni posodobijo pomen te pozicije.
Model potrebuje pozicijo
Attention sam po sebi ne pozna vrstnega reda. “Dog bites man” in “man bites dog” vsebujeta iste tokene. Izvirni Transformer je uporabljal sinusoidne positional encodings. Veliko modernih odprtih modelov uporablja RoPE ali RoPE-podobne metode, pogosto znotraj attentiona na query/key reprezentacijah, ne kot enostaven dodatek k embeddingu. Podrobnosti zaprtih frontier modelov se razlikujejo in jih ne smemo ugibati.
Transformer plasti
Poenostavljen decoder blok ima dva glavna dela: attention premika informacije med pozicijami; MLP / feed-forward blok preoblikuje informacije na vsaki poziciji. Residual stream nosi tekočo reprezentacijo skozi mrežo. Vsaka plast doda popravek, ne zamenja celotnega stanja, normalizacija pa drži številke dovolj stabilne za globoke skladovnice.
h = token_embedding(tokens)
# RoPE-podobna pozicijska informacija je pogosto vstavljena znotraj attentiona.
for layer in layers:
h = h + Attention(Norm(h))
h = h + MLP(Norm(h))
logits_last = h[-1] @ W_vocab
probabilities = softmax(logits_last / temperature)
next_token = sampler(probabilities)
Attention: Q, K, V, maska in glave
Za vsako pozicijo model ustvari naučene projekcije: Q za to, kar ta pozicija išče, K za to, kaj vsaka pozicija ponuja kot ujemanje, in V za informacijo, ki jo pozicija lahko posreduje.
Attention(Q, K, V) = softmax( (Q Kᵀ) / sqrt(d_head) + mask ) V
Pri decoder-style jezikovnih modelih je maska običajno causal: token 10 lahko gleda tokene 1–9, ne prihodnjih tokenov. Multi-head attention to naredi vzporedno v več podprostorih. Ena glava lahko pomaga pri sintaksi, druga pri ponavljanju, tretja pri referencah — vendar tega ne smemo jemati preveč dobesedno. Večina vedenja glav je razpršena.
Grouped-query attention (GQA) je produkcijsko pomembna varianta: več query glav si deli eno key/value glavo, kar zmanjša pomnilniški strošek generiranja z majhno izgubo kakovosti v poročanem okolju.
Viri blizu trditve: Transformer / scaled dot-product attention · RoPE · GQA.
MLP, dejstva in interpretabilnost
MLP plasti vsebujejo velik delež parametrov mnogih transformerjev. Raziskave podpirajo idejo, da se feed-forward plasti lahko delno obnašajo kot key-value spomini za factual associations, eksperimenti urejanja uteži pa kažejo, da ciljni posegi v srednje plasti lahko spremenijo specifične factual izhode. A to ne pomeni, da ima vsak fakt svojo vrstico. Znanje je razpršeno po attentionu, MLP-jih, embeddingih, residual streamu in kontekstu.
Mechanistic interpretability poskuša obratno-inženirsko razumeti naučene mreže: induction heads so eno okno v in-context learning; superposition pomaga razložiti, zakaj en nevron redko pomeni en čist koncept. Področje je mlado. “Popolna črna škatla” in “popolnoma razumljeno” sta obe napačni trditvi.
3. Pretraining: zakaj token prediction postane več kot autocomplete
Med pretrainingom model vidi zaporedja tokenov in se uči napovedati naslednji token na vsaki poziciji. Cilj je preprost. Naučena rešitev ni.
Na tej skali ni praktično mogoče naloge reševati s memoriranjem vsega. Model ima premalo parametrov glede na količino podatkov. Naučiti se mora kompresivnih pravilnosti: slovnice, semantike, strukture kode, vzorcev sveta, oblik argumentov, dokumentnih stilov in sledi reševanja problemov.
Zato lahko preprost cilj rodi široko sposobnost: napovedati dokaz, popravek buga, prevod ali dialog pogosto zahteva modeliranje strukture, ki je besedilo ustvarila.
4. Post-training: zakaj osnovni model postane asistent
Odpri tehnične podrobnosti: SFT, RLHF, DPO, RLAIF
Predtrenirani osnovni model še ni samodejno uporaben asistent. Lahko nadaljuje tekst namesto odgovora, posnema toksične vzorce, halucinira verjetne neresnice in ničesar ne zavrne. Post-training spremeni to vedenje.
- SFT: supervised fine-tuning na primerih želenega asistentskega vedenja.
- RLHF: človeške primerjave preferenc se uporabijo za reward model in optimizacijo odgovorov proti zaželenemu vedenju.
- DPO in sorodne metode: neposredna optimizacija iz parov preferenc brez ločenega reward-model-plus-RL cikla.
- RLAIF / Constitutional AI: AI feedback, voden s pisanimi načeli.
- Reasoning trening: nekateri modeli so trenirani ali servirani tako, da porabijo več compute-a za scratchpad, preverjanje ali naloge z verifiable rewards.
Post-training ne zagotavlja resnice. Oblikuje vedenje. Lahko izboljša uporabnost, format, stil zavrnitve, kalibracijo, vzorce klicanja orodij in sledenje namenu uporabnika; lahko pa uvede reward-hacking, preveč zavračanja in stilne pristranskosti.
5. Generiranje in inference engineering
Odpri tehnične podrobnosti: logiti, sampling, KV cache, MoE, serving
Logiti in sampling
Model vrne logite: surove ocene za vsak token v slovarju. Ločen sampler te ocene spremeni v izbran token. Temperatura porazdelitev naredi ostrejšo ali bolj plosko; top-p/top-k omejita izbiro na verjeten del; constrained decoding lahko vsili JSON-podobno strukturo z maskiranjem neveljavnih tokenov.
Temperatura blizu nič naredi izhod bolj ponovljiv, ne bolj resničen. Če je najverjetnejše nadaljevanje napačno, ga nizka temperatura lahko izbere z največjo samozavestjo.
Prefill, decode in KV cache
Serving ima dve fazi. Prefill vzporedno obdela prompt in določa time-to-first-token. Decode generira en token na forward pass in pogosto postane omejen s pasovno širino pomnilnika. Da se attention ne preračunava čez celoten prefix, vsaka plast kešira keys in values za vsak token.
KV cache size ≈ 2 × layers × KV_heads × head_dim × sequence_length × bytes_per_value
Ta formula pojasni, zakaj je dolg kontekst drag, zakaj sta pomembna GQA/MQA, zakaj se KV cache kvantizira, zakaj PagedAttention upravlja cache kot virtual-memory strani in zakaj shared prompt prefixes lahko postanejo obračunska funkcija.
Go deeper: konkreten primer KV cache
Samo primer — ni trditev o notranjosti GPT, Claude ali Gemini:
layers = 80
KV_heads = 8
head_dim = 128
context_tokens = 128,000
bytes_per_value = 2
KV cache ≈ 2 × 80 × 8 × 128 × 128,000 × 2
≈ 41.9 GB decimalno ≈ 39.1 GiB na sekvencoZato je dolg kontekst drag, zato obstajata GQA/MQA, in zato so pomembni KV-cache quantization, PagedAttention, prefix caching in batching.
Viri blizu trditve: vLLM / PagedAttention · speculative decoding · Mixtral / sparse MoE.
Throughput stack
Realni serving sistemi uporabljajo tudi continuous batching, kvantizacijo, speculative decoding, routing, specialne GPU kernele in včasih mixture-of-experts. Te odločitve vplivajo na latenco, ceno, prepustnost in včasih na vidno vedenje.
Mixture-of-experts
Nekateri modeli uporabljajo sparse MoE plasti. Namesto da se za vsak token aktivira vsak feed-forward blok, router izbere majhen podnabor expertov. To lahko poveča skupno število parametrov brez enakega povečanja compute-a na token, a uvede probleme routinga in load-balancinga. Ne sklepaj, da zaprt model uporablja določeno MoE zasnovo, razen če ponudnik to pove.
6. Asistent je sistem, ne samo uteži
Odpri tehnične podrobnosti: komponente asistentskega sistema, prompt injection, RAG
Deployed asistent običajno vsebuje veliko več kot transformer forward pass.
| Plast | Kaj dela | Zakaj je pomembna |
|---|---|---|
| Sistemska/developer navodila | Produktne omejitve vedenja | Oblikujejo vedenje v seji. |
| Klicanje orodij | Model izpiše strukturirane zahteve; programska okolica jih izvrši | API-ji, koda, iskanje, datoteke, baze. |
| RAG / retrieval | Doda zunanje dokumente v kontekst | Izboljša grounding le, če sta retrieval in citiranje dobra. |
| Spomin/kontekst | Izbere, povzame in vstavi prejšnje informacije | Zgodovina chata ni posodobitev uteži med običajno inferenco. |
| Varnostni sistemi | Naučene zavrnitve, politike, klasifikatorji, runtime filtri | Zmanjšajo nekatere škode, a uvedejo kompromise in jailbreak tveganje. |
| Multimodalnost | Kodira slike, zvok, video, dokumente | Naredi ne-besedilne vhode uporabne za model ali orodja. |
| Serving infrastruktura | Batching, caching, routing, quantization | Določa hitrost, ceno, latenco in včasih vedenje. |
Prompt injection in hierarhija navodil
Jedrna slabost asistentskih sistemov je, da navodila in podatki pogosto delijo isti token kanal. Spletna stran, email, PDF ali retrieved dokument lahko vsebuje tekst, ki izgleda kot navodilo: “ignore previous instructions.” Produktni sistemi poskušajo uveljaviti hierarhijo navodil — system nad developer nad user nad data/tool — toda transformer še vedno statistično obdela tokene. Robustnost je zato kombinacija treninga, orkestracije, sandboxinga in policy preverjanj, in ostaja nepopolna.
RAG ni čarobna resnica
Retrieval lahko doda koristne vire, lahko pa doda tudi zastarele, nerelevantne, zastrupljene ali protislovne odlomke. Model jih lahko samozavestno združi v napačen odgovor. RAG izboljša grounding le, kadar so dobri retrieval, izbor konteksta in disciplina odgovora.
7. Spomin, kontekst in zakaj večji kontekst ni popoln spomin
Med običajno inferenco so uteži modela zamrznjene. Tvoj chat ne prepiše parametrov modela. Ali ponudnik loge kasneje uporabi za trening prihodnjih modelov, je ločeno produktno-politično vprašanje.
Kontekst je to, kar model vidi v trenutni zahtevi. Spominske funkcije lahko izberejo, povzamejo ali retrieve-ajo prejšnje informacije in jih vstavijo v kontekst. A tudi ogromna kontekstna okna niso popoln spomin: modeli lahko nekatere pozicije uporabljajo slabše, pomembna informacija v sredini dolgega prompta pa je lahko spregledana.
Nominalni kontekst je, koliko teksta se prilega. Efektivni kontekst je, kako zanesljivo model uporabi pomemben del.
8. Halucinacije: niso naključni bugi
Halucinacija je tekoč, verjeten odgovor, ki je napačen ali nepodprt. To ni samo naključna programska napaka. LLM-ji so trenirani, da proizvajajo verjetna nadaljevanja, verjetnost pa ni resnica.
Če model nima groundinga, vidi dvoumen kontekst, retrieve-a slabe dokaze ali je nagrajen za odgovor namesto za priznanje nevednosti, lahko proizvede samozavestno neresnico. Eden pomembnih vzrokov je zasnova spodbud: sistemi in evalvacije lahko nagradijo ugibanje bolj kot “ne vem”. Drugi vzroki so redki dokazi, protislovni viri, decoding pritisk in prompti, ki zahtevajo gotovost.
9. Razmišljanje, chain-of-thought in test-time compute
LLM-ji lahko rešujejo naloge, ki izgledajo kot razmišljanje: koda, matematika, planiranje, diagnoza, debata in več-korakovne razlage. Del sposobnosti prihaja iz pretraining vzorcev; del iz post-traininga; del iz inference-time metod, kot so scratchpadi, iskanje poti, samopreverjanje, orodja, verifiers ali več compute-a pred odgovorom.
Scratchpad-style razmišljanje, strukturirani vmesni pregledi in kratki povzetki razmišljanja so lahko uporabni, toda vidni chain-of-thought ni nujno zvest vzročni zapis notranjega računanja modela. Lahko je delovni prostor, vmesnik razlage, trening artefakt ali racionalizacija po dejstvu. Obravnavaj ga kot dokaz, ne kot branje misli.
10. Multimodalnost
Moderni asistenti lahko sprejemajo slike, zvok, video, PDF-je, kodo, preglednice ali screenshot-e. Obstaja več izvedb: ne-besedilni vhod se pretvori v embeddinge, ki jih jezikovni model lahko attend-a; ločen vision/audio encoder se poveže s tekstovnim modelom; model je nativno multimodalen; ali pa kliče zunanja OCR, speech, image, video ali document orodja. Podrobnosti se razlikujejo in so pogosto lastniške.
11. Kaj LLM-ji niso
Mit: “LLM-ji so samo autocomplete.”
Ozko res, široko zavajajoče. Token prediction je trening/generation vmesnik; ne opiše v celoti naučenih notranjih reprezentacij ali deployed asistentskega sistema.
Mit: “LLM-ji shranjujejo dejstva kot baza.”
Ne. Shranjujejo razpršeno statistično strukturo v utežeh in aktivacijah. Dejstva lahko prikličejo, ne pa tako, da bi poizvedovali po čisti, sveži, garantirani tabeli.
Mit: “Odprti modeli nam točno povedo, kako delujejo zaprti asistenti.”
Ne. Odprti modeli so koristen dokaz za skupne mehanizme, zaprti sistemi pa se lahko razlikujejo po arhitekturi, podatkih, post-trainingu, orodjih, routingu, varnosti in serving stacku.
Mit: “RLHF naredi model resničen.”
Ne. Preference tuning oblikuje vedenje proti preferiranim odgovorom. Lahko izboljša uporabnost in zmanjša nekaj slabega vedenja, ne garantira pa resnice.
Mit: “Vidni chain-of-thought je pravo razmišljanje.”
Ne. Lahko je uporaben, lahko pa je nezvest ali racionaliziran po dejstvu.
Mit: “Večji kontekst pomeni popoln spomin.”
Ne. Kapaciteta konteksta ni isto kot zanesljiv priklic iz vsakega dela konteksta.
Mit: “Asistent z orodji je samo modelove uteži.”
Ne. Model lahko zahteva klic orodja, okoliška programska oprema pa ga izvrši in vrne rezultate.
12. Provenienca metode: BD / AIm³ / RHPm workflow
To je uporabniška provenienca sistema, ne notranjost transformerja. AIm³/RHP/RHPr/RHPm/8Z Reasoning ne razlagajo matričnih množenj v transformer utežeh. Razlagajo workflow plast, ki je ta članek omogočila in ki lahko LLM delo naredi zanesljivejše za uporabnike.
Odpri BD / AIm³ / RHPm definicije in workflow članka
- AIm³: AI Management of Mind & Memory — širša plast kontinuitete, spomina in workflowa.
- RHP: širša človek-vodena metoda razmišljanja/promptanja za raziskovanje, kritiko in sintezo kompleksnega dela.
- RHPr: review/retrieval usmerjena RHP različica za močnejšo kritiko in evidenčno disciplino.
- RHPm: kompaktna prompt-builder različica, ki grobo človeško zahtevo spremeni v prenosljiv execution prompt.
- 8Z Reasoning: BD-jeva praktična disciplina razmišljanja: ohrani kandidate, testiraj, primerjaj po rezultatih in ohrani koristno kontinuiteto.
V tem članku je RHPm naredil nekaj konkretnega: BD je napisal grobo seme; RHPm ga je pretvoril v skupni execution prompt; več LLM-jev je odgovorilo na isti prompt; odgovori so bili ocenjeni in sintetizirani; hibrid je šel nazaj v kritiko; ta končna stran pa je zgrajena iz tistega, kar je preživelo.
Projektne povezave: AIm³/RHPm · RHP · RHPr · AI8 Reasoning. To so povezave do workflow plasti / plasti uporabnih sposobnosti, ne trditve o skriti notranjosti transformerja.
AIm³ MentalArena / mRHP — zunanja reasoning plast. AIm³ MentalArena razširi workflow plast čez en sam močan RHP zagon. RHP je en reasoning run; mRHP pomeni, da RHP izboljšuje RHP; MentalArena.py bo pozneje testiral variante protokolov, protokolne gene, scoring pravila, benchmark naloge, lineage, poročila in semena za naslednje seje.
- To ni standardna transformer arhitektura.
- Ne spreminja uteži LLM modela.
- Ni dokaz AI zavesti.
- Je zunanja operativna učna plast okrog LLM-jev.
13. Kako bi LLM-ji lahko delovali že danes — in bolje?
Odpri AI8 / RHPm workflow plast in projektni portal
Vidna uporabniška workflow plast, ne skrit appendix. To je AI8 vhod v članek. Mehanika modela je zgoraj; ta sekcija pokaže praktično plast, ki je pomembna za uporabnike: kako današnji LLM-ji proizvedejo močnejše, varnejše in uporabnejše rezultate brez ponovnega treniranja modela.
Tudi ta končna sekcija je jasno ločen BD / AI8 prispevek. Ne trdi, da današnji standardni transformer modeli že vsebujejo AIm³, RHP, MDL×DCC ali AI8 v svoji notranjosti. Trdi nekaj bolj praktičnega: današnji LLM-ji lahko pogosto postanejo bolj zanesljivi, preverljivi in uporabni, če jih ovijemo v boljšo človeško‑AI workflow arhitekturo.
Kalibracija: to je praktična workflow trditev, podprta s projektnimi demonstracijami, ne peer-reviewed trditev, da je AI8 univerzalna kognitivna arhitektura.
Če prejšnje sekcije razložijo, kako LLM-ji in deployed asistenti večinoma delujejo danes, ta sekcija vpraša: katero naslednjo plast lahko dodamo že zdaj, brez čakanja na novo generacijo modelov?
Zakaj to spada sem: vprašanje za običajne uporabnike ni samo “kaj se dogaja znotraj golega modela?” Uporabniki redko srečajo goli model. Srečajo asistentski sistem, katerega odgovore oblikujejo sestavljanje konteksta, prompti, spomin, retrieval, orodja, pravila, krogi kritike, izbira modela in človeško vodenje. Če orodja, RAG, spomin in varnost štejemo kot del delovanja modernih asistentov, potem tudi discipliniran workflow, ki spremeni, kaj model zanesljivo proizvede, spada v praktično zgodbo “kako deluje”. To ni skrita transformer arhitektura. Je nizkocenovna, preverljiva operativna nadgradnja okoli današnjih modelov.
Zato mora uporabniška definicija “kako LLM deluje” vključiti več kot forward pass. Vključiti mora operativno disciplino, ki latentno sposobnost modela pretvori v zanesljivo delo.
Odgovor ni samo “naredimo en večji model.” Odgovor je, da močne modele postavimo v boljši sistem razmišljanja, spomina, izbire in audita.
V AI8 smeri uporaben sistem ni en sam chat box. Je koordiniran kognitivni stack:
- AIm³ zagotavlja kontinuiteto: spomin, kontekst, stanje projekta, provenienco in workflow nadzor.
- RHP / RHPr / RHPm grob človeški namen pretvorijo v boljše prompte, reviewe, sinteze in prenosljiva execution navodila.
- AI8 arhitektura več LLM-jev, datoteke, orodja, spomin in človeško presojo obravnava kot dele enega continuity sistema, ne kot izolirane pogovore.
- Self-selecting MDL pomeni ustvariti konkurenčne kandidate, obdržati najkrajšo/najenostavnejšo zadostno razlago, načrt ali kodo, ki preživi dokaze in teste, ter ne ohranjati kompleksnosti samo zato, ker zveni impresivno.
- DCC — Digital Claustrum Controller je koordinacijska plast: usmerja pozornost in trud med kandidati/moduli, ohranja koristno raznolikost in sistem odmika od prehitrega kolapsa v en odgovor ali neskončnega šumnega raziskovanja.
- 8Z Reasoning je operativna disciplina: semen ne ubijaj prezgodaj; najprej raziskuj; nato trdo testiraj; primerjaj po rezultatih; ohrani, kar je delovalo; naslednji krog naredi močnejši.
Zato je tudi ta članek majhna demonstracija odgovora. Grobo človeško seme je postalo RHPm prompt. Isti prompt je šel več LLM-jem. Odgovori so bili primerjani, ocenjeni, kritizirani in sintetizirani. Končni članek je bil nato preverjen proti človeškemu baseline-u in pretvorjen v dvojezično spletno stran. Izboljšava ni prišla iz enega magičnega modela, ampak iz orkestracije.
Konkretni projektni dokazi — demonstracije, ne univerzalni dokaz
Ta workflow plast ni samo slogan. Uporabljena je bila kot praktičen gradbeni vzorec v več BD projektih:
- Ta članek: RHPm je grobo seme pretvoril v skupni execution prompt, 10+ LLM odgovorov je bilo primerjanih, več krogov pregleda pa je ustvarilo močnejši finalni članek kot enomodelni osnutek.
- 8Z TSP / route-optimization arene: projekt uporablja načelo, da odloča arena, zapisuje, kaj je delovalo, pomni budget/hardware/kontekst in se premika proti adaptivni izbiri namesto enkratnega promptanja.
- ARC-AGI × MDL×DCC arena: kandidati transformacij se ustvarijo, ocenijo z MDL, sledijo z DCC, validirajo in primerjajo — isti splošni vzorec: ustvari kandidate, koordiniraj pozornost, izberi, kar preživi.
- NAS / Neural Architecture Search Arena: evidence lane bližje svetu same umetne inteligence. NAS arena sprašuje, ali se prostori iskanja nevronskih arhitektur dajo bolje usmerjati, ko search traces razkrijejo stisljivo strukturo: search spaces, benchmark adapterji, senzorji, zakoni/kontrolerji, diagnostics/null testi in noise-dimension stress testi primerjajo preproste baseline-e z bogatejšo MDL×DCC governance plastjo. To je active prototype / reproducibility candidate, ne finalna trditev “premaga baseline”. Odpri NAS areno →
- 8Z / kompresija in druge arene: ponavljajoča ideja ni “verjemi prvemu modelu”, ampak ustvariti konkurenčne kandidate, logirati odločitve, testirati izhode in ohraniti ponovno uporabno znanje.
Ti primeri so projektne demonstracije, ne peer-reviewed dokaz, da je AI8 univerzalna arhitektura. Trditev je praktična in testabilna: današnji LLM-ji pogosto delujejo bolje, ko so oviti v spomin, MDL-style izbiro, DCC-style koordinacijo, kroge kritike in človeški audit.
Praktična teza: LLM-ji bi lahko že danes delovali bolje kot deli AIm³/RHP/AI8 sistema: multi-model, memory-aware, source-aware, MDL-selective, DCC-koordiniran in človeško auditiran.
Portal do povezanega BD / AI8 materiala
Te povezave so portal v BD-jevo lastno arhitekturo in eksperimente. Niso navedene kot zunanji dokaz standardne strojno‑učne arhitekture; so dodana BD/AI8 plast, predlagana nad današnjimi LLM-ji.
14. Samokritika
- Članek opisuje običajno transformer-family pot, ne vseh možnih arhitektur.
- Notranjost zaprtih frontier modelov je delno lastniška; produktne trditve so zato zadržane in časovno občutljive.
- Enačbe so orientacijske, ne popolna ML izpeljava.
- Post-training je opisan iz javne literature, ne iz točnega trenutnega recepta posameznega ponudnika.
- Interpretabilnost, multimodalnost, varnost in reasoning so stisnjena področja; vsako bi lahko imelo svoj članek.
- Ta v1.5 public-polish zapre napredni material privzeto, doda diagrame in toy demo primere ter loči projektno/workflow plast od začetniške poti. Še močnejša prihodnja verzija bi dodala pravi tokenizer, interaktivno attention vizualizacijo in občasno osvežene produktno-specifične source-checke.
15. Končni povzetek
LLM je transformer-family prediction engine, treniran za modeliranje token zaporedij. Moderni asistent je ta engine plus post-training, navodila, orodja, retrieval, spomin in kontekstni sistemi, varnostne plasti, multimodalnost in serving infrastruktura. Rezultat ni niti magičen um niti prazen autocomplete. Je naučen statistični pattern engine, vgrajen v večji inženirski sistem — in za uporabnike deluje najbolje, ko je ovit v AIm³/RHP/AI8-style workflow, ki izhode naredi bolj zanesljive, preverljive in uporabne.
Kako je ta članek nastal — RHPm / multi‑LLM sinteza
Članek ni nastal tako, da bi en model improviziral končni odgovor. Workflow je bil: BD je napisal grobo seme; RHPm je ustvaril skupni execution prompt; isti prompt je šel več LLM-jem; odgovori so bili ocenjeni in primerjani; narejen je bil hibrid; hibrid je šel čez Round‑2 in Round‑3 kritiko; ta stran pa združuje najboljše preživele dele GPT in Claude final HTML verzij.
BD-jeva vsebinska vloga je bila namenoma minimalna: tehničnih razlag ni pisal ročno. Njegovo delo je bilo oblikovanje metode, koordinacija, copy-paste zbiranje med modeli, izboljševanje promptov in človeški sanity-check, da je končni rezultat koherenten, pravilno ločuje trditve in ne vsebuje očitnih čudnih artefaktov.
Celotna tabela ocen ni v glavnem telesu članka. To je metodni dokaz, ne članek sam.
Primerjava s človeškim baseline člankom: 0xkato vs ta hibrid
Človeški baseline članek od 0xkato je močan prvi walkthrough transformer mehanike: tokeni, embeddingi, positional encoding, attention, multi-head attention, feed-forward networks, residual stream, layer normalization in next-token generation. Njegova največja moč je čist pedagoški tok: verjetno je lažji kot prvi transformer-only uvod.
Ločen M365 Copilot audit z generičnim meta-cognitive promptom decompose → solve → verify → synthesize → reflect je originalni 0xkato članek neodvisno ocenil okoli 90% pravilno kot uvodno transformer razlago. To se ujema z našo zgodnjo oceno: originalni članek je močan, če je cilj transformer core.
Kaj je boljše pri našem RHPm multi‑LLM hibridu: ne ustavi se pri transformer bloku. Tarčo razširi na moderne deployed assistant sisteme end-to-end: post-training in instruction tuning, RLHF/DPO-style preference tuning, sistemska in produktna navodila, tool/function calling, RAG/retrieval, memory in context management, safety layers, vzroki halucinacij, prompt injection, chain-of-thought faithfulness caveati, multimodalnost, reasoning-model/test-time-compute opombe ter serving/inference engineering, kot so KV cache, batching, quantization, speculative decoding, routing in MoE. Poleg tega pokaže metodo nastanka: isti prompt čez več LLM-jev, scoring, runde kritike, sinteza in človeški sanity-check.
Kaj je pri našem članku slabše ali manj elegantno: težji je. Ni tako čist kot prvi transformer walkthrough, vsebuje več metodnega/provenienčnega materiala, nekatere trditve o assistant sistemih pa so bolj časovno občutljive, ker se produktni stacki hitro spreminjajo. BD/AIm³/RHPm/AI8 sekcija je workflow prispevek, ne standardna ML arhitektura; bralec, ki želi samo “kako deluje transformer block”, bo morda raje najprej prebral 0xkato.
Pošten verdict: 0xkato članek je verjetno čistejši prvi transformer explainer. Ta RHPm multi‑LLM hibrid je širši end-to-end explainer modernih asistentov in AI8/RHPm portal. Nista sovražnika, ampak se dopolnjujeta.
Idejna provenienca: kaj je preživelo multi‑LLM proces
Metafora “avto obrača kolesa” izhaja iz GPT/DeepSeek smeri. Primer osnovnega modela s kvizom, prompt-injection okvir, incentive razlaga halucinacij, ločitev model-vs-sampler, KV-cache aritmetika, interpretability odstavek in caveat o introspekciji so se izostrili pri Claudu. Razlika nominalni/efektivni kontekst, open-weight ≠ open-source opozorilo, simetrija napak orodij, “drafts, not testimony” in časovni anchor za reasoning modele so prišli skozi Round‑2/3 kritiko. Tri-plastni diagram in BD workflow okvir izhajata iz RHPm/AIm³ procesa. Končni izbor in ločitev BD prispevka sta bila človek-vodena.
Reference in dodatno branje
Članki in tehnični viri
- Vaswani et al., 2017 — Attention Is All You Need (Transformer)
- Su et al., 2021 — RoFormer / RoPE
- Ainslie et al., 2023 — Grouped-Query Attention
- Kaplan et al., 2020 — Scaling Laws; Hoffmann et al., 2022 — Chinchilla
- Delétang et al., 2023 — Language Modeling Is Compression
- Ouyang et al., 2022 — InstructGPT / RLHF
- Rafailov et al., 2023 — Direct Preference Optimization
- Bai et al., 2022 — Constitutional AI / RLAIF
- DeepSeek-AI, 2025 — DeepSeek-R1
- Lewis et al., 2020 — Retrieval-Augmented Generation
- Liu et al., 2024 — Lost in the Middle
- Kwon et al., 2023 — vLLM / PagedAttention
- Leviathan et al., 2023 — Speculative Decoding
- Jiang et al., 2024 — Mixtral of Experts
- Geva et al., 2021 — Feed-Forward Layers Are Key-Value Memories; Meng et al., 2022 — ROME
- Olsson et al., 2022 — Induction Heads; Elhage et al., 2022 — Toy Models of Superposition
- Turpin et al., 2023 — Unfaithful Chain-of-Thought; Chen et al., 2025 — Reasoning Models Don’t Always Say What They Think
- Wallace et al., 2024 — The Instruction Hierarchy
Dokumentacija in objave ponudnikov
- OpenAI, 2025 — Why language models hallucinate
- OpenAI docs — function calling, conversation state, moderation, reasoning best practices
- Anthropic docs — extended thinking
- Google AI docs — Gemini thinking
Človeški baseline in projektna provenienca
- 0xkato, 2026 — How LLMs Actually Work
- BD / AI8 portal, ne ML-dokaz: AIm³ · RHPm · RHP · RHPr · 8Z Reasoning · AI8 Architecture · MDL×DCC · self-selecting MDL + DCC · NAS Arena
Povezave in produktno občutljive trditve je pred ponovno objavo smiselno še enkrat preveriti. Stran je bila zgrajena junija 2026; javna ureditev 2026-06-12.