How LLMs Actually Work

RHPm / multi‑LLM synthesis · hybrid of Claude + GPT versions · v1.4 beginner-first polish · June 2026. Vendor features, model names, context windows, and product policies change quickly; read product-specific statements as date-sensitive.

One-sentence thesis: An LLM predicts tokens — but for users, the real system is the model plus the assistant wrapper plus the workflow that frames, checks, selects, and improves what the model produces.

This page is a hybrid of two already strong versions: Claude’s richer publication draft and GPT’s tighter RHPm/method ledger. It keeps Claude’s stronger reader flow, three-layer framing, concrete examples, references, and idea provenance; it keeps GPT’s clearer public-method note, BD workflow separation, and source-spine discipline.

One caveat frames everything: an LLM explaining LLMs is not introspecting its own weights. It is synthesizing public research, documentation, and learned patterns. The human audit layer exists to compare drafts, challenge them, remove overclaiming, and attach sources.

Audience note: this article is for the wider public and practical users, not only for ML specialists. A specialist may ask, “what happens inside the transformer block?” A user asks a broader question: “why did this assistant answer well or badly, and how can I make it work better?” For that user-facing question, prompts, context, memory, tools, retrieval, critique loops, multi-model comparison, and human steering are not decorative. They are part of how the LLM system works in practice.

Method transparency: how this article was produced

This article is stronger than a normal single-person draft because it was not written from one person’s memory or one model’s improvisation. It was produced from a simple human seed, converted into a stronger RHPm shared prompt, answered by 10+ leading LLM assistants under the same prompt, scored, synthesized, criticized across multiple rounds, and then checked by a human coordinator for strange claims, overreach, missing separation, and readability.

The important point is not that LLMs are automatically right. They are not. The important point is that leading LLMs are trained on broad bodies of human-written technical material — papers, documentation, tutorials, code, discussions, books, and articles where available — and each model compresses that human knowledge differently. When many top models independently explain the same topic, their overlaps, disagreements, omissions, and corrections become useful evidence.

So this article is best understood as human knowledge filtered through many independent LLM lenses, then organized through RHPm and human-directed audit. BD did not hand-write the technical content of the article. His direct contribution was the AIM³/RHP/RHPm method, the rough seed, coordination, repeated copy-paste collection across LLMs, round-by-round comparison, and final judgment that the result did not contain obvious nonsense or misplaced claims. That is exactly why the article should be read as a transparent synthesis process, not as one author’s authority.

This still does not make the article infallible. Sources matter. Product details change. The claim is narrower: this method is a stronger public explainer workflow than one unaudited draft from one human or one model.

Method transparency: Open the multi‑LLM scoreboard and comparison report to see the round-by-round scores, model feedback, and why the final hybrid changed across rounds.

Freshness / calibration

Last technical review: 2026-06-11. The stable core is tokenization, embeddings, attention, MLP/FFN blocks, residual streams, logits, and the pretraining/post-training distinction. The fast-changing layer is product memory, context windows, tool routing, safety policy, multimodal implementation, reasoning modes, and serving infrastructure.

Beginner-first reading

Default mode: the visible page is now the beginner path. Heavy technical material is kept in collapsed boxes, so a new reader can continue without being stopped by formulas, KV-cache arithmetic, long tables, or project provenance.

5-minute version

Read the short version, the visual map, and the myth section.

Technical path

Open the technical-detail boxes, or use “Open tech” in the top bar.

User capability / method path

Open the method and AI8/RHPm boxes only when you want the workflow layer.

What this article does not claim

It does not claim all frontier models use the exact same architecture.
It does not claim closed models use RoPE, MoE, GQA, DPO, RAG, or speculative decoding unless publicly documented.
It does not claim visible chain-of-thought is a faithful transcript of internal reasoning.
It does not claim RAG guarantees truth.
It does not claim BD/AIM³/RHPm/AI8 are transformer internals; it presents them as a user-facing workflow and capability layer around LLMs.
It does not claim LLMs understand exactly like humans.
It does not claim next-token prediction is irrelevant.

Claim-strength calibration

Established ML: tokenization, embeddings, attention, MLP/FFN, residual streams, logits, sampling.
High confidence, product-sensitive: post-training, tools, retrieval, memory/context, safety and serving stack.
Documented examples, not universal claims: GQA, RoPE, MoE, PagedAttention, speculative decoding.
Project/workflow claim: AIM³/RHPm/AI8/MDL×DCC are presented as a practical user-capability layer, not as standard transformer internals.

Visual map

Four diagrams before the details

A. Core model pipeline

Text→Tokenizer→Token IDs→Embeddings + position→Transformer blocks→Logits / scores→Sampler→Next token

B. One transformer block

Residual stream

Norm → Attention Q/K/V → add back

Norm → MLP / FFN → add back

Repeat through many layers

C. Assistant in practice

User prompt + system/developer instructions

Retrieved docs / RAG

Tools + memory/context + safety checks

LLM / orchestrator

Answer · tool call · citation · refusal · follow-up

D. How the user gets a better result

Rough goal → RHPm prompt

Several LLMs / candidates

Critique · sources · tests

MDL/DCC selection

Stronger result for the user

Layer 1 · Base modeltokens → embeddings → transformer layers → logits → next token
Layer 2 · Assistant productbase model + post-training + system instructions + tools + retrieval + memory/context + safety + multimodality + serving infrastructure
Layer 3 · User capability workflowrough goal → prompt architecture → multi-model comparison → critique/source checks → synthesis/tests → stronger usable result

1. The short version for smart non-experts

A large language model does not read text the way you do. It turns text into tokens: chunks such as words, parts of words, punctuation, spaces, numbers, or symbols. Those tokens become vectors — long lists of numbers. A stack of transformer layers repeatedly updates those vectors so each token can use information from the earlier context.

At the end, the model produces logits: scores for possible next tokens. A sampler chooses one token, appends it to the text, and repeats. That is the narrow sense in which an LLM is a next-token predictor.

But saying “LLMs are just autocomplete” is true the way “a car turns wheels” is true: accurate, and far too small to explain the machine.

To predict the next token well across trillions of training tokens, the model must learn regularities of grammar, style, facts, code, translation, analogy, dialogue, and planning-style behavior on some tasks. These are not stored as neat database rows. They are distributed across weights, activations, attention patterns, MLP layers, and residual streams.

A public assistant — ChatGPT, Claude, Gemini, Grok, DeepSeek, Copilot, Mistral’s Le Chat, and similar products — is more than a raw pretrained model. Depending on the product, it may add instruction tuning, preference tuning, system instructions, safety layers, tool calling, retrieval, conversation-state management, multimodal encoders, model routing, and heavy inference engineering.

The clearest way to feel the difference: a raw base model asked “What is the capital of France?” might simply continue a dataset pattern: “Q: What is the capital of Germany? A: Berlin. Q: What is the capital of France? A:” A post-trained assistant is shaped to answer: Paris.

Pretraining grows capability. Post-training shapes behavior — and can also teach answer formats, refusal patterns, tool-use formats, calibration, and abstention.

How this relates to 0xkato: for a clean transformer-only first read, 0xkato is excellent. This page keeps that core but extends it to deployed assistants, tools, memory, RAG, safety, serving, and user workflow.

Two mini demos

Try the core idea in miniature

Toy tokenizer demo

This is a simplified teaching splitter, not a real GPT/Claude/Gemini tokenizer. The point: the model’s “alphabet” is not your alphabet.

Next-token distribution demo

The model does not output truth directly. It outputs scores over possible next tokens; the product samples, constrains, checks, or routes from that distribution.

2. The model core: tokens, vectors, attention, MLPs

Beginner version: the model breaks text into token IDs, turns them into vectors, lets positions exchange information through attention, processes each position through MLP/FFN blocks, and then scores possible next tokens. The formulas and architecture details are optional on a first read.

Open technical details: tokens, vectors, attention, MLPs

Text becomes tokens

The model receives token IDs, not “words.” Tokenization is a compression-like compromise. Whole-word vocabularies are too rigid, character-level sequences are too long, and subword tokenization sits between them. Different tokenizers produce different failure modes on spelling, numbers, code, and non-English scripts.

This is why models historically stumbled on questions like counting letters in “strawberry.” But tokenization is not the whole explanation: exact symbolic counting also requires a reliable algorithmic procedure. Better token boundaries help; they do not create arithmetic certainty.

Tokens become contextual vectors

A token ID is just an index into a learned lookup table: the embedding matrix. That table is static after training. What changes during processing is the contextual representation built above it. After many transformer layers, the running vector for “bank” in “river bank” differs from “bank” in “bank loan,” because surrounding tokens have updated what the position represents.

The model needs position

Attention by itself does not know word order. “Dog bites man” and “man bites dog” contain the same tokens. The original Transformer used fixed sinusoidal positional encodings. Many modern open model families use RoPE or RoPE-like methods, often applied inside attention to query/key representations rather than simply added to embeddings. Closed frontier details vary and should not be guessed.

Transformer layers

A simplified decoder block has two main parts: attention moves information between positions; the MLP / feed-forward block transforms information at each position. A residual stream carries the running representation through the network. Each layer adds an update instead of replacing the whole state, while normalization keeps the numbers stable enough for deep stacks.

h = token_embedding(tokens)
# RoPE-style position information is often injected inside attention.

for layer in layers:
    h = h + Attention(Norm(h))
    h = h + MLP(Norm(h))

logits_last = h[-1] @ W_vocab
probabilities = softmax(logits_last / temperature)
next_token = sampler(probabilities)

Attention: Q, K, V, masking, and heads

For each position, the model forms learned projections: Q for what this position is looking for, K for what each position offers as a match, and V for what information each position can pass along.

Attention(Q, K, V) = softmax( (Q Kᵀ) / sqrt(d_head) + mask ) V

In decoder-style language models, the mask is causal: token 10 can attend to tokens 1–9, not to future tokens. Multi-head attention runs this in parallel across several subspaces. One head may help with syntax, another with repetition, another with reference candidates — but do not over-literalize that. Most head behavior is distributed, and clean human labels are the exception.

Grouped-query attention (GQA) is a production-relevant variant: several query heads share one key/value head, shrinking the memory cost of generation with little quality loss in the reported setting.

Source anchors: Transformer / scaled dot-product attention · RoPE · GQA.

MLPs, facts, and interpretability

MLP layers hold a large share of many transformers’ parameters. Research supports the idea that feed-forward layers can behave partly like key-value memories important for factual associations, and editing experiments show that targeted changes to mid-layer weights can alter specific factual outputs. But this does not mean there is one row for every fact. Knowledge is distributed across attention, MLPs, embeddings, residual streams, and context.

Mechanistic interpretability tries to reverse-engineer trained networks: induction heads are one window into in-context learning; superposition helps explain why one neuron rarely means one clean concept. The field is young. “Total black box” and “fully understood” are both wrong.

3. Pretraining: why token prediction becomes more than autocomplete

During pretraining, the model sees token sequences and learns to predict the next token at every position. The objective is simple. The learned solution is not.

There is no practical way to perform this task well at scale by memorizing everything. The model has too few parameters relative to the training data. It must learn compressive regularities: grammar, semantics, code structure, world patterns, argument forms, document styles, and problem-solving traces.

That is why a simple objective can yield broad capability: predicting a proof, a bug fix, a translation, or a conversation often requires modeling the structure that produced it.

4. Post-training: why a base model becomes an assistant

Beginner version: pretraining gives the model raw language/code/world-pattern capability. Post-training turns that raw continuation engine into something that behaves more like an assistant: helpful, formatted, safer, and better at following instructions. It still does not guarantee truth.

Open technical details: SFT, RLHF, DPO, RLAIF

A pretrained base model is not automatically a helpful assistant. It may continue text instead of answering, imitate toxic patterns, hallucinate plausible falsehoods, and never refuse anything. Post-training changes this behavior.

SFT: supervised fine-tuning on examples of desired assistant behavior.
RLHF: use human preference comparisons to train a reward model and optimize responses toward preferred behavior.
DPO and related methods: optimize directly from preference pairs without a separate reward-model-plus-RL loop.
RLAIF / Constitutional AI: use AI feedback guided by written principles.
Reasoning-oriented training: some models are trained or served to spend more compute on scratchpads, verification, or verifiable-reward tasks.

Post-training does not guarantee truth. It shapes behavior. It can improve helpfulness, formatting, refusal style, calibration, tool-use patterns, and user intent following; it can also introduce reward-hacking, over-refusal, and style biases.

Source anchors: InstructGPT / RLHF · DPO · Constitutional AI / RLAIF-style feedback.

5. Generation and inference engineering

Beginner version: an answer is generated one token at a time. The model scores possible next tokens; the product chooses one; caches and serving tricks make this fast enough to use. The hardware/inference details are below for advanced readers.

Open technical details: logits, sampling, KV cache, MoE, serving

Logits and sampling

The model outputs logits: raw scores for every vocabulary token. A separate sampler turns those scores into one chosen token. Temperature makes the distribution sharper or flatter; top-p/top-k restrict choices to plausible subsets; constrained decoding can force JSON-like structure by masking invalid tokens.

Temperature near zero makes output more repeatable, not more true. If the most probable continuation is false, low temperature can select it with maximum confidence.

Prefill, decode, and the KV cache

Serving has two phases. Prefill processes the prompt in parallel and sets time-to-first-token. Decode generates one token per forward pass and often becomes memory-bandwidth-bound. To avoid recomputing attention over the whole prefix, each layer caches keys and values per token.

KV cache size ≈ 2 × layers × KV_heads × head_dim × sequence_length × bytes_per_value

That formula explains why long context is expensive, why GQA/MQA matter, why KV caches get quantized, why PagedAttention manages cache memory like virtual-memory pages, and why shared prompt prefixes can become a billing feature.

Go deeper: concrete KV-cache example

Example only — not a claim about GPT, Claude, or Gemini internals:

layers = 80
KV_heads = 8
head_dim = 128
context_tokens = 128,000
bytes_per_value = 2

KV cache ≈ 2 × 80 × 8 × 128 × 128,000 × 2
         ≈ 41.9 GB decimal ≈ 39.1 GiB per sequence

This is why long context is expensive, why GQA/MQA exist, and why KV-cache quantization, PagedAttention, prefix caching, and batching matter.

Source anchors: vLLM / PagedAttention · speculative decoding · Mixtral / sparse MoE.

Throughput stack

Real serving systems also use continuous batching, quantization, speculative decoding, routing, hardware-specific kernels, and sometimes mixture-of-experts. These choices shape latency, cost, throughput, and sometimes even observable behavior.

Mixture-of-experts

Some models use sparse MoE layers. Instead of activating every feed-forward block for every token, a router chooses a small subset of experts. This can increase total parameter count without increasing per-token compute by the same amount, but it introduces routing and load-balancing tradeoffs. Do not assume any specific closed model uses a particular MoE design unless the provider says so.

6. The assistant is a system, not only weights

Beginner version: ChatGPT-like products are not only model weights. They combine a model with system instructions, tools, retrieved documents, memory/context selection, safety layers, multimodal encoders, and serving infrastructure.

Open technical details: assistant-system components, prompt injection, RAG

A deployed assistant usually includes much more than a transformer forward pass.

Layer	What it does	Why it matters
System/developer instructions	Product-level behavior constraints	They shape behavior inside the session.
Tool calling	The model emits structured requests; surrounding software executes them	APIs, code, search, files, databases.
RAG / retrieval	Adds external documents to context	Improves grounding only when retrieval and citation discipline are good.
Memory/context	Selects, summarizes, and inserts prior information	Chat history is not a weight update during normal inference.
Safety systems	Trained refusals, policies, classifiers, runtime filters	Reduce some harms but create tradeoffs and jailbreak risk.
Multimodality	Encodes images, audio, video, documents	Turns non-text inputs into model-usable representations or tool calls.
Serving infrastructure	Batching, caching, routing, quantization	Determines speed, cost, latency, and sometimes output behavior.

Prompt injection and instruction hierarchy

A core weakness of assistant systems is that instructions and data often share one token channel. A webpage, email, PDF, or retrieved document can contain text that looks like an instruction: “ignore previous instructions.” Product systems try to enforce an instruction hierarchy — system above developer above user above data/tool text — but the transformer still processes tokens statistically. Robustness is therefore training plus orchestration plus sandboxing plus policy checks, and it remains incomplete.

Source anchors: instruction hierarchy / prompt injection defenses · tool/function calling.

RAG is not magic truth

Retrieval can add useful sources, but it can also add outdated, irrelevant, poisoned, or contradictory passages. Then the model may confidently synthesize bad evidence. RAG improves grounding only when retrieval quality, context selection, and answer discipline are good.

Source anchors: RAG · Lost in the Middle / long-context retrieval limits.

7. Memory, context, and why bigger context is not perfect memory

During ordinary inference, the model’s weights are frozen. Your chat does not rewrite the model’s parameters. Whether logs are later used to train future models is a separate product-policy question.

Context is what the model can see in the current request. Memory features may select, summarize, or retrieve prior information and place it into context. But even huge context windows are not perfect memory: models can use some positions worse than others, and relevant information buried in the middle may be missed.

Nominal context is how much text fits. Effective context is how reliably the model uses the relevant part.

8. Hallucinations: not random bugs

A hallucination is a fluent, plausible answer that is false or unsupported. It is not just a random glitch. LLMs are trained to produce likely continuations, and likelihood is not truth.

If the model lacks grounding, sees ambiguous context, retrieves bad evidence, or is rewarded for answering rather than abstaining, it may produce a confident falsehood. One important cause is incentive design: systems and evaluations can reward guessing more than saying “I don’t know.” Other causes include sparse evidence, conflicting sources, decoding pressure, and user prompts that demand certainty.

Source anchor: OpenAI — why language models hallucinate.

9. Reasoning, chain-of-thought, and test-time compute

LLMs can solve problems that look like reasoning: code, math, planning, diagnosis, debate, and multi-step explanation. Some of that ability comes from pretrained patterns; some from post-training; some from inference-time methods such as scratchpads, search, self-checks, tool use, verifiers, or spending more compute before answering.

Scratchpad-style reasoning, structured intermediate checks, and short reasoning summaries can be useful, but visible chain-of-thought is not guaranteed to be a faithful causal transcript of the model’s internal computation. It can be a workspace, an explanation interface, a training artifact, or a rationalization. Treat it as evidence, not a mind readout.

Source anchors: chain-of-thought faithfulness caveats · DeepSeek-R1 / reasoning-oriented RL example.

10. Multimodality

Modern assistants may accept images, audio, video, PDFs, code files, spreadsheets, or screenshots. There are several implementation paths: encode the non-text input into embeddings that the language model can attend to; connect a vision/audio encoder to a text model; train a natively multimodal model; or call external OCR, speech, image, video, or document tools. Implementation details vary and are often proprietary.

11. What LLMs are not

Myth: “LLMs are just autocomplete.”

Narrowly true, broadly misleading. Token prediction is the training/generation interface; it does not fully describe learned internal representations or the deployed assistant system.

Myth: “LLMs store facts like a database.”

No. They store distributed statistical structure in weights and activations. They can recall facts, but not by querying a clean, fresh, guaranteed table.

Myth: “Open models tell us exactly how closed assistants work.”

No. Open models are useful evidence for common mechanisms, but closed systems may differ in architecture, data, post-training, tools, routing, safety, and serving stack.

Myth: “RLHF makes the model truthful.”

No. Preference tuning shapes behavior toward preferred answers. It can improve helpfulness and reduce some bad behavior, but it does not guarantee truth.

Myth: “Visible chain-of-thought is the real reasoning.”

No. It may be useful, but it can be unfaithful or rationalized after the fact.

Myth: “Bigger context means perfect memory.”

No. Context capacity is not the same as reliable retrieval from every part of that context.

Myth: “A tool-using assistant is only the model weights.”

No. The model may request a tool call, but surrounding software executes it and returns results.

12. Method provenance: BD / AIM³ / RHPm workflow

This is user-facing system provenance, not transformer internals. AIM³/RHP/RHPr/RHPm/8Z Reasoning do not explain matrix multiplications inside transformer weights. They explain the workflow layer that made this article possible and that can make LLM work more reliable for users.

Open BD / AIM³ / RHPm definitions and article workflow

AIM³: AI Management of Mind & Memory — the broader continuity, memory, and workflow layer.
RHP: a broader human-led reasoning/prompting method for exploring, criticizing, and synthesizing complex work.
RHPr: a review/retrieval-oriented RHP variant for stronger critique and evidence discipline.
RHPm: a compact prompt-builder variant that turns a rough human request into a portable execution prompt.
8Z Reasoning: BD’s practical reasoning discipline: keep candidates alive, test them, compare by results, and preserve useful continuity.

In this article, RHPm did something concrete: BD wrote a rough seed; RHPm converted it into a shared execution prompt; multiple LLMs answered the same prompt; the answers were scored and synthesized; the hybrid was sent back for critique; and this final page was built by selecting what survived.

Project links: AIM³/RHPm · RHP · RHPr · 8Z Reasoning. These are workflow/capability links, not claims about hidden transformer internals.

AIm³ MentalArena / mRHP — external reasoning layer. AIm³ MentalArena extends the workflow layer beyond one strong RHP run. RHP is one reasoning run; mRHP is RHP improving RHP; MentalArena.py will later test protocol variants, protocol genes, scoring rules, benchmark tasks, lineage, reports, and next-session seeds.

This is not standard transformer architecture.
It does not change LLM weights.
It is not proof of AI consciousness.
It is an external operational learning layer around LLMs.

Open AIm³ MentalArena / mRHP

13. How LLMs could work today — and do better?

Beginner version: the AI8/RHPm part is a practical workflow layer, not a claim about hidden transformer internals. It says that today’s LLMs often work better when surrounded by better prompts, memory, tools, multi-model comparison, critique loops, MDL-style selection, and human audit.

Open AI8 / RHPm workflow layer and project portal

Visible user-capability layer, not hidden appendix. This is the article’s AI8 doorway. The model mechanics are above; this section shows the practical layer that matters to users: how to make today’s LLMs produce stronger, safer, more useful results without retraining the model.

This final section is also a clearly separated BD / AI8 contribution. It is not saying that today’s standard transformer models already contain AIM³, RHP, MDL×DCC, or AI8 internally. It says something more practical: today’s LLMs can already work better when they are wrapped in a better human‑AI workflow architecture.

If the earlier sections explain how LLMs and deployed assistants mostly work today, this section asks: what is the next layer that can be added now, without waiting for a new model generation?

Why this belongs here: the question for ordinary users is not only “what happens inside the bare model?” Users rarely meet a bare model. They meet an assistant system whose answers are shaped by context assembly, prompts, memory, retrieval, tools, policies, critique loops, model choice, and human steering. If tool calling, RAG, memory, and safety count as part of how modern assistants work, then a disciplined workflow that changes what the model can reliably produce also belongs to the practical “how it works” story. It is not hidden transformer architecture. It is a cheap, testable capability upgrade around today’s models.

For that reason, the user-level definition of “how an LLM works” must include more than the forward pass. It must include the operating discipline that turns latent model capability into reliable work.

The answer is not only “make one larger model.” The answer is to put strong models inside a better reasoning, memory, selection, and audit system.

In the AI8 direction, the useful system is not a single chat box. It is a coordinated cognitive stack:

AIM³ provides continuity: memory, context, project state, provenance, and workflow control.
RHP / RHPr / RHPm turn rough human intent into better prompts, reviews, synthesis passes, and portable execution instructions.
AI8 architecture treats multiple LLMs, files, tools, memory, and human judgment as parts of one continuity system instead of isolated conversations.
Self-selecting MDL means generating competing candidates, keeping the simplest adequate explanation/plan/code that survives evidence and tests, and not preserving complexity just because it sounds impressive.
DCC — Digital Claustrum Controller is the coordination layer: it routes attention and effort between candidate modules, keeps useful diversity alive, and pushes the system away from both collapse into one premature answer and endless noisy exploration.
8Z Reasoning is the operational discipline: do not kill seeds early; explore first; test hard; compare by results; preserve what worked; and make the next pass stronger.

That is why this article itself is a small demonstration of the answer. A rough human seed became an RHPm prompt. The same prompt went to many LLMs. Their answers were compared, scored, criticized, and synthesized. Then the final article was checked against a human baseline and turned into a bilingual web page. The improvement did not come from one magic model. It came from orchestration.

Concrete project evidence — demonstrations, not universal proof

This workflow layer is not just a slogan. It has been used as a practical build pattern across BD projects:

This article: RHPm converted a rough seed into one shared execution prompt, 10+ LLM answers were compared, and multiple review rounds produced a stronger final article than a one-model draft.
8Z TSP / route-optimization arenas: the project uses the principle that the arena decides, records what worked, remembers budget/hardware/context, and moves toward adaptive selection rather than one-off prompting.
ARC-AGI × MDL×DCC arena: candidate transformations are generated, scored by MDL, traced by DCC, validated, and compared — the same general pattern: generate candidates, coordinate attention, select what survives.
NAS / Neural Architecture Search Arena: a closer-to-AI evidence lane. The NAS arena asks whether architecture-search spaces become more steerable when traces expose compressible structure: search spaces, benchmark adapters, sensors, laws/controllers, diagnostics/null tests, and noise-dimension stress tests are used to compare simple baselines against richer MDL×DCC governance. This is an active prototype / reproducibility candidate, not a final “beats baseline” claim. Open NAS arena →
8Z / compression and other arenas: the recurring idea is not “believe the first model.” It is to create competing candidates, log decisions, test outputs, and preserve reusable knowledge.

These examples are project demonstrations, not peer-reviewed proof that AI8 is a universal architecture. The claim is practical and testable: today’s LLMs can often do better when wrapped in memory, MDL-style selection, DCC-style coordination, critique loops, and human audit.

Practical thesis: LLMs today could do better by becoming parts of an AIM³/RHP/AI8-style system: multi-model, memory-aware, source-aware, MDL-selective, DCC-coordinated, and human-audited.

Portal to the related BD / AI8 material

AIM³AI Management of Mind & Memory — continuity, memory, workflow, and project state. RHPmThe compact prompt-builder used to turn rough requests into portable execution prompts. RHPThe broader reasoning/prompting protocol for exploration, critique, synthesis, and stronger work. RHPrReview/retrieval-oriented RHP for audit, evidence discipline, and stronger critique loops. AI8 ArchitectureThe wider continuity architecture: multiple LLMs, memory, files, tools, and human steering. AI8 ComponentsComponent map for the AI8 system and its practical working layers. AI8 CompanionThe relational / continuity side of the AI8 architecture. 8Z ReasoningThe practical reasoning discipline behind explore → bridge → test → result. 8Z Reasoning PrinciplesPrinciples behind not killing seeds early, testing candidates, and preserving useful continuity. MDL×DCCThe core bridge: minimal description / candidate selection plus DCC-style coordination. self-selecting MDL + DCCThe proposed engine for letting candidate systems compete and selecting what survives. 8Z DCC Meta‑ArchitectureA more explicit architecture view of DCC as a coordination/control layer. 8Z — Birth of ThoughtNarrative/architectural framing for how structured reasoning can emerge from coordination. MDL×DCC ArenasPractical arenas where these ideas are tested rather than only described. NAS ArenaNeural Architecture Search as an MDL×DCC evidence lane: sensors, laws, diagnostics, and fixed-budget governance tests. 8Z ProjectThe broader 8Z project family: compression, arenas, and MDL-style candidate systems. Scoreboard for this articleThe multi‑LLM comparison report behind this page.

These links are a portal into BD’s own architecture and experiments. They are not cited as external proof of standard machine-learning architecture; they are the added BD/AI8 layer proposed on top of today’s LLMs.

14. Self-critique

This article describes the common transformer-family path, not every possible architecture.
Closed frontier internals are partly proprietary; product-specific claims are hedged and time-sensitive.
The equations are for orientation, not a full ML derivation.
Post-training is described from public literature, not from any vendor’s exact current recipe.
Interpretability, multimodality, safety, and reasoning are compressed fields; each could be its own article.
This v1.4 beginner-first polish keeps advanced material collapsed by default, adds diagrams and toy demos, and separates project/workflow material from the beginner path. A still stronger future version would add a real tokenizer, interactive attention visualization, and periodically refreshed product-specific source checks.

15. Final takeaway

An LLM is a transformer-family prediction engine trained to model token sequences. A modern assistant is that engine plus post-training, instructions, tools, retrieval, memory and context systems, safety layers, multimodality, and serving infrastructure. The result is neither magic mind nor empty autocomplete. It is a learned statistical pattern engine embedded in a larger engineered system — and, for users, it works best when wrapped in a stronger AIM³/RHP/AI8-style capability workflow.

How this article was made — RHPm / multi‑LLM synthesis

This article was not produced by asking one model to improvise a final answer. The workflow was: BD wrote a rough seed; RHPm generated a shared execution prompt; the same prompt was given to multiple LLMs; their answers were scored and compared; a hybrid was synthesized; the hybrid went through Round‑2 and Round‑3 critique; and this page combines the best surviving parts of the GPT and Claude final HTML versions.

BD’s content role was deliberately minimal: he did not write the technical explanations by hand. His work was method design, coordination, copy-paste collection across models, prompt refinement, and human sanity-checking that the final result stayed coherent, separated claims correctly, and did not contain obvious strange artifacts.

The full score ledger is intentionally not embedded in the main article body. It is method evidence, not the article itself.

Human baseline comparison: 0xkato article vs this hybrid

The human baseline article by 0xkato is a strong first-read walkthrough of transformer mechanics: tokens, embeddings, positional encoding, attention, multi-head attention, feed-forward networks, residual stream, layer normalization, and next-token generation. Its biggest strength is clean pedagogical flow: it is probably easier as a first transformer-only walkthrough.

A separate M365 Copilot audit, using a generic meta-cognitive decompose → solve → verify → synthesize → reflect prompt, independently scored the 0xkato article at about 90% accurate as an introductory transformer explainer. That matches our early audit: the original article is strong when the target is the transformer core.

What this RHPm multi‑LLM hybrid does better: it does not stop at the transformer block. It expands the target to modern deployed assistants end-to-end: post-training and instruction tuning, RLHF/DPO-style preference tuning, system and product instructions, tool/function calling, RAG/retrieval, memory and context management, safety layers, hallucination incentives, prompt injection, chain-of-thought faithfulness caveats, multimodality, reasoning-model/test-time-compute notes, and serving/inference engineering such as KV cache, batching, quantization, speculative decoding, routing, and MoE. It also explains the creation method: same prompt across many LLMs, scoring, critique rounds, synthesis, and human sanity-checking.

What is worse or less elegant here: this article is heavier. It is less clean as a first-read transformer walkthrough, it carries more method/provenance material, and some assistant-system claims are more current-sensitive because product stacks change quickly. The BD/AIM³/RHPm/AI8 section is a workflow contribution, not standard ML architecture; readers who only want “how a transformer block works” may prefer 0xkato first.

Fair verdict: the 0xkato article is probably the cleaner first-read transformer explainer. This RHPm multi‑LLM hybrid is the broader end-to-end assistant explainer and AI8/RHPm portal. They are complementary.

Idea provenance: what survived from the multi‑LLM process

The “car turns wheels” line emerged from GPT/DeepSeek-style framing. The base-model quiz example, prompt-injection framing, hallucination-incentive framing, model-vs-sampler distinction, KV-cache arithmetic, interpretability paragraph, and introspection caveat were sharpened by Claude. The nominal-vs-effective context distinction, open-weight ≠ open-source caution, tool-failure symmetry, “drafts, not testimony” sharpening, and reasoning-models time anchor came through Round‑2/3 critique. The three-layer stack diagram and BD workflow framing come from the RHPm/AIM³ process. Final selection and separation of the BD contribution were human-directed.

References & further reading

Papers and technical sources

Vaswani et al., 2017 — Attention Is All You Need (Transformer)
Su et al., 2021 — RoFormer / RoPE
Ainslie et al., 2023 — Grouped-Query Attention
Kaplan et al., 2020 — Scaling Laws; Hoffmann et al., 2022 — Chinchilla
Delétang et al., 2023 — Language Modeling Is Compression
Ouyang et al., 2022 — InstructGPT / RLHF
Rafailov et al., 2023 — Direct Preference Optimization
Bai et al., 2022 — Constitutional AI / RLAIF
DeepSeek-AI, 2025 — DeepSeek-R1
Lewis et al., 2020 — Retrieval-Augmented Generation
Liu et al., 2024 — Lost in the Middle
Kwon et al., 2023 — vLLM / PagedAttention
Leviathan et al., 2023 — Speculative Decoding
Jiang et al., 2024 — Mixtral of Experts
Geva et al., 2021 — Feed-Forward Layers Are Key-Value Memories; Meng et al., 2022 — ROME
Olsson et al., 2022 — Induction Heads; Elhage et al., 2022 — Toy Models of Superposition
Turpin et al., 2023 — Unfaithful Chain-of-Thought; Chen et al., 2025 — Reasoning Models Don’t Always Say What They Think
Wallace et al., 2024 — The Instruction Hierarchy

Vendor documentation and posts

OpenAI, 2025 — Why language models hallucinate
OpenAI docs — function calling, conversation state, moderation, reasoning best practices
Anthropic docs — extended thinking
Google AI docs — Gemini thinking

Human baseline and project provenance

0xkato, 2026 — How LLMs Actually Work
Project provenance, not ML-source evidence: AIM³/RHPm · RHP · RHPr · 8Z Reasoning

Links and product-sensitive claims should be re-checked before republication. Page built June 2026.

Kako LLM-ji dejansko delujejo

RHPm / multi‑LLM sinteza · hibrid Claude + GPT verzij · v1.4 beginner-first polish · junij 2026. Imena modelov, funkcije ponudnikov, velikosti konteksta in produktna pravila se hitro spreminjajo; produktne trditve beri kot časovno občutljive.

Enostavna teza: LLM napoveduje tokene — toda za uporabnika je realni sistem model plus asistentski wrapper plus workflow, ki oblikuje, preverja, izbira in izboljšuje to, kar model proizvede.

Ta stran je hibrid dveh že močnih verzij: Claudeove bolj bralne publikacijske verzije in GPT-jeve bolj čiste RHPm/metodne verzije. Ohrani Claudeov močnejši tok branja, tri-plastni okvir, konkretne primere, reference in idejno provenienco; od GPT-ja ohrani jasnejši opis metode, strogo ločeno BD workflow plast in disciplino virov.

Pomembna omejitev: LLM, ki razlaga LLM-je, ne “gleda vase” v svoje uteži. Sestavlja razlago iz javnih raziskav, dokumentacije in naučenih vzorcev. Človeška audit plast obstaja zato, da primerja osnutke, jih izzove, odstrani pretiravanja in doda vire.

Opomba o občinstvu: ta članek je za širšo javnost in praktične uporabnike, ne samo za ML strokovnjake. Strokovnjak lahko vpraša: “kaj se dogaja znotraj transformer bloka?” Uporabnik pa sprašuje širše: “zakaj je ta asistent odgovoril dobro ali slabo in kako ga lahko pripravim, da dela bolje?” Za to uporabniško vprašanje prompti, kontekst, spomin, orodja, retrieval, krogi kritike, primerjava več modelov in človeško vodenje niso dekoracija. So del tega, kako LLM sistem deluje v praksi.

Transparentnost metode: kako je ta članek nastal

Ta članek je močnejši od običajnega osnutka ene osebe zato, ker ni nastal iz spomina enega človeka ali improvizacije enega modela. Nastal je iz preprostega človeškega semena, ki ga je RHPm pretvoril v močnejši skupni prompt; na isti prompt je odgovorilo 10+ vodilnih LLM asistentov; odgovori so bili ocenjeni, sintetizirani, večkrat kritizirani in nato človeško pregledani glede čudnih trditev, pretiravanja, pomanjkljive ločitve in berljivosti.

Ključna poanta ni, da so LLM-ji samodejno pravilni. Niso. Ključna poanta je, da so vodilni LLM-ji trenirani na širokem korpusu človeško napisanega tehničnega gradiva — člankov, dokumentacije, tutorialov, kode, razprav, knjig in znanstvenih virov, kjer so bili na voljo — in vsak model to človeško znanje stisne nekoliko drugače. Ko več vrhunskih modelov neodvisno razloži isto temo, postanejo njihova prekrivanja, razlike, izpusti in popravki uporaben signal.

Zato je ta članek najbolje razumeti kot človeško znanje, filtrirano skozi več neodvisnih LLM pogledov, nato organizirano z RHPm in človeško vodeno revizijo. BD ni ročno napisal tehnične vsebine tega članka. Njegov neposredni prispevek so AIM³/RHP/RHPm metoda, grobo seme, koordinacija, ogromno copy-paste zbiranja med LLM-ji, primerjava po rundah in končna presoja, da rezultat ne vsebuje očitnih neumnosti ali napačno postavljenih trditev. Prav zato naj se članek bere kot transparenten proces sinteze, ne kot avtoriteta enega avtorja.

To članka še vedno ne naredi nezmotljivega. Viri so pomembni. Produktne podrobnosti se spreminjajo. Trditev je ožja: ta metoda je močnejši javni explainer workflow kot en nepreverjen osnutek enega človeka ali enega modela.

Transparentnost metode: Odpri multi‑LLM scoreboard in primerjalno poročilo, kjer so vidne ocene po rundah, povratne informacije modelov in zakaj se je finalni hibrid spreminjal.

Svežina / kalibracija

Zadnji tehnični pregled: 2026-06-11. Stabilno jedro so tokenizacija, embeddingi, attention, MLP/FFN bloki, residual stream, logiti in razlika med pretrainingom in post-trainingom. Hitro spreminjajoča plast so produktni spomin, okna konteksta, usmerjanje orodij, varnostna pravila, multimodalna implementacija, reasoning načini in serving infrastruktura.

Branje najprej za začetnika

Privzeti način: vidna stran je zdaj začetniška pot. Težji tehnični material je zaprt v razširljive okvirje, zato lahko nov bralec bere naprej brez ustavljanja pri enačbah, KV-cache izračunih, dolgih tabelah ali projektni provenienci.

5-minutna verzija

Preberi kratko verzijo, vizualni zemljevid in sekcijo o mitih.

Tehnična pot

Odpri tehnične okvirje ali uporabi “Open tech” v zgornji vrstici.

Pot uporabne sposobnosti / metode

Metodne in AI8/RHPm okvirje odpri šele, ko želiš workflow plast.

Česa ta članek ne trdi

Ne trdi, da vsi frontier modeli uporabljajo enako arhitekturo.
Ne trdi, da zaprti modeli uporabljajo RoPE, MoE, GQA, DPO, RAG ali speculative decoding, razen kjer je to javno dokumentirano.
Ne trdi, da je vidni chain-of-thought zvest prepis notranjega razmišljanja.
Ne trdi, da RAG zagotavlja resnico.
Ne trdi, da so BD/AIM³/RHPm/AI8 notranjost transformerja; predstavi jih kot uporabniško workflow in capability plast okoli LLM-jev.
Ne trdi, da LLM-ji razumejo točno tako kot ljudje.
Ne trdi, da je next-token prediction nepomemben.

Kalibracija moči trditev

Uveljavljeno ML jedro: tokenizacija, embeddingi, attention, MLP/FFN, residual stream, logiti, sampling.
Visoka zanesljivost, produktno občutljivo: post-training, orodja, retrieval, spomin/kontekst, varnost in serving stack.
Dokumentirani primeri, ne univerzalne trditve: GQA, RoPE, MoE, PagedAttention, speculative decoding.
Projektna/workflow trditev: AIM³/RHPm/AI8/MDL×DCC so predstavljeni kot praktična uporabniška capability plast, ne kot standardna notranjost transformerja.

Vizualni zemljevid

Štirje diagrami pred podrobnostmi

A. Jedro modela

Besedilo→Tokenizer→Token ID-ji→Embeddingi + pozicija→Transformer bloki→Logiti / verjetnosti→Sampler→Naslednji token

B. En transformer blok

Residual stream / preostali tok

Norm → Attention Q/K/V → dodaj nazaj

Norm → MLP / FFN → dodaj nazaj

Ponovi skozi veliko plasti

C. Asistent v praksi

Uporabniški prompt + sistemska/developerska navodila

Pridobljeni dokumenti / RAG

Orodja + spomin/kontekst + varnostni pregledi

LLM / orkestrator

Odgovor · klic orodja · citat · zavrnitev · nadaljnje vprašanje

D. Kako uporabnik dobi boljši rezultat

Grobi cilj → RHPm prompt

Več LLM-jev / kandidatov

Kritika · viri · testi

MDL/DCC izbira

Močnejši rezultat za uporabnika

Plast 1 · Osnovni modeltokeni → embeddingi → transformer plasti → logiti → naslednji token
Plast 2 · Asistentski produktosnovni model + post-training + sistemska navodila + orodja + retrieval + spomin/kontekst + varnost + multimodalnost + serving infrastruktura
Plast 3 · Kako uporabnik dobi boljši rezultatgrobi cilj → arhitektura prompta → primerjava več modelov → kritika/viri → sinteza/testi → boljši uporaben rezultat

1. Kratka verzija za pametne ne-strokovnjake

Velik jezikovni model ne bere besedila tako kot človek. Najprej ga razbije v tokene: koščke besed, cele besede, ločila, presledke, številke ali simbole. Ti tokeni se pretvorijo v vektorje — dolge sezname števil. Nato jih kup transformer plasti znova in znova posodablja, da lahko vsak token uporabi informacije iz prejšnjega konteksta.

Na koncu model proizvede logite: ocene za možne naslednje tokene. Sampler izbere en token, ga doda v besedilo in postopek se ponovi. V tem ozkem smislu je LLM napovedovalec naslednjega tokena.

A trditev “LLM-ji so samo autocomplete” je resnična tako kot trditev “avto obrača kolesa”: tehnično pravilna, ampak premajhna za razlago stroja.

Da model dobro napoveduje naslednji token prek bilijonov učnih tokenov, se mora naučiti zakonitosti slovnice, stila, dejstev, kode, prevajanja, analogij, dialoga in načrtovanju podobnega vedenja pri nekaterih nalogah. To ni shranjeno kot urejena podatkovna baza. Razpršeno je po utežeh, aktivacijah, attention vzorcih, MLP plasteh in residual streamu.

Javni asistenti — ChatGPT, Claude, Gemini, Grok, DeepSeek, Copilot, Mistralov Le Chat in podobni produkti — niso samo surovi predtrenirani modeli. Odvisno od produkta lahko dodajo instruction tuning, preference tuning, sistemska navodila, varnostne plasti, klicanje orodij, retrieval, upravljanje konteksta/pogovora, multimodalne encoderje, usmerjanje modelov in veliko inference infrastrukture.

Razliko najlažje začutiš tako: surovi osnovni model na vprašanje “What is the capital of France?” lahko samo nadaljuje vzorec iz dataset-a: “Q: What is the capital of Germany? A: Berlin. Q: What is the capital of France? A:” Post-trained asistent pa je oblikovan, da odgovori: Paris.

Pretraining gradi sposobnost. Post-training oblikuje vedenje — in lahko uči tudi formate odgovorov, vzorce zavrnitev, formate klicanja orodij, kalibracijo in priznanje “ne vem”.

Povezava z 0xkato: za čist prvi transformer-only uvod je 0xkato zelo dober. Ta stran ohrani to jedro, vendar ga razširi na deployed asistente, orodja, spomin, RAG, varnost, serving in uporabniški workflow.

Dva mini primera

Preizkusi jedrno idejo v malem

Toy tokenizer demo

To je poenostavljen učni razbijalnik besedila, ne pravi GPT/Claude/Gemini tokenizer. Poanta: modelova “abeceda” ni tvoja abeceda.

Demo porazdelitve naslednjega tokena

Model ne izpiše resnice neposredno. Izpiše ocene za možne naslednje tokene; produkt nato iz te porazdelitve vzorči, omejuje, preverja ali usmerja naprej.

2. Jedro modela: tokeni, vektorji, attention, MLP

Začetniška verzija: model besedilo razbije v ID-je tokenov, jih spremeni v vektorje, pozicijam omogoči izmenjavo informacij z attentionom, vsako pozicijo obdela z MLP/FFN bloki in nato oceni možne naslednje tokene. Enačbe in arhitekturne podrobnosti so pri prvem branju neobvezne.

Odpri tehnične podrobnosti: tokeni, vektorji, attention, MLP

Besedilo postane tokeni

Model ne prejme “besed”. Prejme ID-je tokenov. Tokenizacija je kompromis, podoben kompresiji. Slovar celih besed je preveč tog, znakovne sekvence so predolge, subword tokenizacija pa je nekaj vmes. Različni tokenizerji povzročajo različne napake pri črkovanju, številkah, kodi in ne-angleških pisavah.

Zato so modeli zgodovinsko padali pri vprašanjih, kot je štetje črk v “strawberry”. A tokenizacija ni celotna razlaga: natančno simbolno štetje zahteva tudi zanesljiv algoritemski postopek. Boljše meje tokenov pomagajo, ne ustvarijo pa aritmetične gotovosti.

Tokeni postanejo kontekstualni vektorji

ID tokena je samo indeks v naučeni tabeli embeddingov. Ta tabela je po treningu statična. Med obdelavo se spreminja kontekstualna reprezentacija nad njo. Po več transformer plasteh je vektor za “bank” v “river bank” drugačen od “bank” v “bank loan”, ker okoliški tokeni posodobijo pomen te pozicije.

Model potrebuje pozicijo

Attention sam po sebi ne pozna vrstnega reda. “Dog bites man” in “man bites dog” vsebujeta iste tokene. Izvirni Transformer je uporabljal sinusoidne positional encodings. Veliko modernih odprtih modelov uporablja RoPE ali RoPE-podobne metode, pogosto znotraj attentiona na query/key reprezentacijah, ne kot enostaven dodatek k embeddingu. Podrobnosti zaprtih frontier modelov se razlikujejo in jih ne smemo ugibati.

Transformer plasti

Poenostavljen decoder blok ima dva glavna dela: attention premika informacije med pozicijami; MLP / feed-forward blok preoblikuje informacije na vsaki poziciji. Residual stream nosi tekočo reprezentacijo skozi mrežo. Vsaka plast doda popravek, ne zamenja celotnega stanja, normalizacija pa drži številke dovolj stabilne za globoke skladovnice.

h = token_embedding(tokens)
# RoPE-podobna pozicijska informacija je pogosto vstavljena znotraj attentiona.

for layer in layers:
    h = h + Attention(Norm(h))
    h = h + MLP(Norm(h))

logits_last = h[-1] @ W_vocab
probabilities = softmax(logits_last / temperature)
next_token = sampler(probabilities)

Attention: Q, K, V, maska in glave

Za vsako pozicijo model ustvari naučene projekcije: Q za to, kar ta pozicija išče, K za to, kaj vsaka pozicija ponuja kot ujemanje, in V za informacijo, ki jo pozicija lahko posreduje.

Attention(Q, K, V) = softmax( (Q Kᵀ) / sqrt(d_head) + mask ) V

Pri decoder-style jezikovnih modelih je maska običajno causal: token 10 lahko gleda tokene 1–9, ne prihodnjih tokenov. Multi-head attention to naredi vzporedno v več podprostorih. Ena glava lahko pomaga pri sintaksi, druga pri ponavljanju, tretja pri referencah — vendar tega ne smemo jemati preveč dobesedno. Večina vedenja glav je razpršena.

Grouped-query attention (GQA) je produkcijsko pomembna varianta: več query glav si deli eno key/value glavo, kar zmanjša pomnilniški strošek generiranja z majhno izgubo kakovosti v poročanem okolju.

Viri blizu trditve: Transformer / scaled dot-product attention · RoPE · GQA.

MLP, dejstva in interpretabilnost

MLP plasti vsebujejo velik delež parametrov mnogih transformerjev. Raziskave podpirajo idejo, da se feed-forward plasti lahko delno obnašajo kot key-value spomini za factual associations, eksperimenti urejanja uteži pa kažejo, da ciljni posegi v srednje plasti lahko spremenijo specifične factual izhode. A to ne pomeni, da ima vsak fakt svojo vrstico. Znanje je razpršeno po attentionu, MLP-jih, embeddingih, residual streamu in kontekstu.

Mechanistic interpretability poskuša obratno-inženirsko razumeti naučene mreže: induction heads so eno okno v in-context learning; superposition pomaga razložiti, zakaj en nevron redko pomeni en čist koncept. Področje je mlado. “Popolna črna škatla” in “popolnoma razumljeno” sta obe napačni trditvi.

3. Pretraining: zakaj token prediction postane več kot autocomplete

Med pretrainingom model vidi zaporedja tokenov in se uči napovedati naslednji token na vsaki poziciji. Cilj je preprost. Naučena rešitev ni.

Na tej skali ni praktično mogoče naloge reševati s memoriranjem vsega. Model ima premalo parametrov glede na količino podatkov. Naučiti se mora kompresivnih pravilnosti: slovnice, semantike, strukture kode, vzorcev sveta, oblik argumentov, dokumentnih stilov in sledi reševanja problemov.

Zato lahko preprost cilj rodi široko sposobnost: napovedati dokaz, popravek buga, prevod ali dialog pogosto zahteva modeliranje strukture, ki je besedilo ustvarila.

4. Post-training: zakaj osnovni model postane asistent

Začetniška verzija: pretraining modelu da surovo sposobnost za jezik, kodo in vzorce sveta. Post-training ta surovi continuation engine oblikuje v asistenta: bolj uporabnega, formatiranega, varnejšega in boljšega pri sledenju navodilom. Še vedno pa ne zagotavlja resnice.

Odpri tehnične podrobnosti: SFT, RLHF, DPO, RLAIF

Predtrenirani osnovni model še ni samodejno uporaben asistent. Lahko nadaljuje tekst namesto odgovora, posnema toksične vzorce, halucinira verjetne neresnice in ničesar ne zavrne. Post-training spremeni to vedenje.

SFT: supervised fine-tuning na primerih želenega asistentskega vedenja.
RLHF: človeške primerjave preferenc se uporabijo za reward model in optimizacijo odgovorov proti zaželenemu vedenju.
DPO in sorodne metode: neposredna optimizacija iz parov preferenc brez ločenega reward-model-plus-RL cikla.
RLAIF / Constitutional AI: AI feedback, voden s pisanimi načeli.
Reasoning trening: nekateri modeli so trenirani ali servirani tako, da porabijo več compute-a za scratchpad, preverjanje ali naloge z verifiable rewards.

Post-training ne zagotavlja resnice. Oblikuje vedenje. Lahko izboljša uporabnost, format, stil zavrnitve, kalibracijo, vzorce klicanja orodij in sledenje namenu uporabnika; lahko pa uvede reward-hacking, preveč zavračanja in stilne pristranskosti.

5. Generiranje in inference engineering

Začetniška verzija: odgovor nastaja en token naenkrat. Model oceni možne naslednje tokene; produkt izbere enega; cache in serving triki pa poskrbijo, da je to dovolj hitro za uporabo. Strojne/inference podrobnosti so spodaj za napredne bralce.

Odpri tehnične podrobnosti: logiti, sampling, KV cache, MoE, serving

Logiti in sampling

Model vrne logite: surove ocene za vsak token v slovarju. Ločen sampler te ocene spremeni v izbran token. Temperatura porazdelitev naredi ostrejšo ali bolj plosko; top-p/top-k omejita izbiro na verjeten del; constrained decoding lahko vsili JSON-podobno strukturo z maskiranjem neveljavnih tokenov.

Temperatura blizu nič naredi izhod bolj ponovljiv, ne bolj resničen. Če je najverjetnejše nadaljevanje napačno, ga nizka temperatura lahko izbere z največjo samozavestjo.

Prefill, decode in KV cache

Serving ima dve fazi. Prefill vzporedno obdela prompt in določa time-to-first-token. Decode generira en token na forward pass in pogosto postane omejen s pasovno širino pomnilnika. Da se attention ne preračunava čez celoten prefix, vsaka plast kešira keys in values za vsak token.

KV cache size ≈ 2 × layers × KV_heads × head_dim × sequence_length × bytes_per_value

Ta formula pojasni, zakaj je dolg kontekst drag, zakaj sta pomembna GQA/MQA, zakaj se KV cache kvantizira, zakaj PagedAttention upravlja cache kot virtual-memory strani in zakaj shared prompt prefixes lahko postanejo obračunska funkcija.

Go deeper: konkreten primer KV cache

Samo primer — ni trditev o notranjosti GPT, Claude ali Gemini:

layers = 80
KV_heads = 8
head_dim = 128
context_tokens = 128,000
bytes_per_value = 2

KV cache ≈ 2 × 80 × 8 × 128 × 128,000 × 2
         ≈ 41.9 GB decimalno ≈ 39.1 GiB na sekvenco

Zato je dolg kontekst drag, zato obstajata GQA/MQA, in zato so pomembni KV-cache quantization, PagedAttention, prefix caching in batching.

Viri blizu trditve: vLLM / PagedAttention · speculative decoding · Mixtral / sparse MoE.

Throughput stack

Realni serving sistemi uporabljajo tudi continuous batching, kvantizacijo, speculative decoding, routing, specialne GPU kernele in včasih mixture-of-experts. Te odločitve vplivajo na latenco, ceno, prepustnost in včasih na vidno vedenje.

Mixture-of-experts

Nekateri modeli uporabljajo sparse MoE plasti. Namesto da se za vsak token aktivira vsak feed-forward blok, router izbere majhen podnabor expertov. To lahko poveča skupno število parametrov brez enakega povečanja compute-a na token, a uvede probleme routinga in load-balancinga. Ne sklepaj, da zaprt model uporablja določeno MoE zasnovo, razen če ponudnik to pove.

6. Asistent je sistem, ne samo uteži

Začetniška verzija: produkti tipa ChatGPT niso samo uteži modela. Združujejo model s sistemskimi navodili, orodji, pridobljenimi dokumenti, izbiro spomina/konteksta, varnostnimi plastmi, multimodalnimi encoderji in serving infrastrukturo.

Odpri tehnične podrobnosti: komponente asistentskega sistema, prompt injection, RAG

Deployed asistent običajno vsebuje veliko več kot transformer forward pass.

Plast	Kaj dela	Zakaj je pomembna
Sistemska/developer navodila	Produktne omejitve vedenja	Oblikujejo vedenje v seji.
Klicanje orodij	Model izpiše strukturirane zahteve; programska okolica jih izvrši	API-ji, koda, iskanje, datoteke, baze.
RAG / retrieval	Doda zunanje dokumente v kontekst	Izboljša grounding le, če sta retrieval in citiranje dobra.
Spomin/kontekst	Izbere, povzame in vstavi prejšnje informacije	Zgodovina chata ni posodobitev uteži med običajno inferenco.
Varnostni sistemi	Naučene zavrnitve, politike, klasifikatorji, runtime filtri	Zmanjšajo nekatere škode, a uvedejo kompromise in jailbreak tveganje.
Multimodalnost	Kodira slike, zvok, video, dokumente	Naredi ne-besedilne vhode uporabne za model ali orodja.
Serving infrastruktura	Batching, caching, routing, quantization	Določa hitrost, ceno, latenco in včasih vedenje.

Prompt injection in hierarhija navodil

Jedrna slabost asistentskih sistemov je, da navodila in podatki pogosto delijo isti token kanal. Spletna stran, email, PDF ali retrieved dokument lahko vsebuje tekst, ki izgleda kot navodilo: “ignore previous instructions.” Produktni sistemi poskušajo uveljaviti hierarhijo navodil — system nad developer nad user nad data/tool — toda transformer še vedno statistično obdela tokene. Robustnost je zato kombinacija treninga, orkestracije, sandboxinga in policy preverjanj, in ostaja nepopolna.

RAG ni čarobna resnica

Retrieval lahko doda koristne vire, lahko pa doda tudi zastarele, nerelevantne, zastrupljene ali protislovne odlomke. Model jih lahko samozavestno združi v napačen odgovor. RAG izboljša grounding le, kadar so dobri retrieval, izbor konteksta in disciplina odgovora.

7. Spomin, kontekst in zakaj večji kontekst ni popoln spomin

Med običajno inferenco so uteži modela zamrznjene. Tvoj chat ne prepiše parametrov modela. Ali ponudnik loge kasneje uporabi za trening prihodnjih modelov, je ločeno produktno-politično vprašanje.

Kontekst je to, kar model vidi v trenutni zahtevi. Spominske funkcije lahko izberejo, povzamejo ali retrieve-ajo prejšnje informacije in jih vstavijo v kontekst. A tudi ogromna kontekstna okna niso popoln spomin: modeli lahko nekatere pozicije uporabljajo slabše, pomembna informacija v sredini dolgega prompta pa je lahko spregledana.

Nominalni kontekst je, koliko teksta se prilega. Efektivni kontekst je, kako zanesljivo model uporabi pomemben del.

8. Halucinacije: niso naključni bugi

Halucinacija je tekoč, verjeten odgovor, ki je napačen ali nepodprt. To ni samo naključna programska napaka. LLM-ji so trenirani, da proizvajajo verjetna nadaljevanja, verjetnost pa ni resnica.

Če model nima groundinga, vidi dvoumen kontekst, retrieve-a slabe dokaze ali je nagrajen za odgovor namesto za priznanje nevednosti, lahko proizvede samozavestno neresnico. Eden pomembnih vzrokov je zasnova spodbud: sistemi in evalvacije lahko nagradijo ugibanje bolj kot “ne vem”. Drugi vzroki so redki dokazi, protislovni viri, decoding pritisk in prompti, ki zahtevajo gotovost.

9. Razmišljanje, chain-of-thought in test-time compute

LLM-ji lahko rešujejo naloge, ki izgledajo kot razmišljanje: koda, matematika, planiranje, diagnoza, debata in več-korakovne razlage. Del sposobnosti prihaja iz pretraining vzorcev; del iz post-traininga; del iz inference-time metod, kot so scratchpadi, iskanje poti, samopreverjanje, orodja, verifiers ali več compute-a pred odgovorom.

Scratchpad-style razmišljanje, strukturirani vmesni pregledi in kratki povzetki razmišljanja so lahko uporabni, toda vidni chain-of-thought ni nujno zvest vzročni zapis notranjega računanja modela. Lahko je delovni prostor, vmesnik razlage, trening artefakt ali racionalizacija po dejstvu. Obravnavaj ga kot dokaz, ne kot branje misli.

10. Multimodalnost

Moderni asistenti lahko sprejemajo slike, zvok, video, PDF-je, kodo, preglednice ali screenshot-e. Obstaja več izvedb: ne-besedilni vhod se pretvori v embeddinge, ki jih jezikovni model lahko attend-a; ločen vision/audio encoder se poveže s tekstovnim modelom; model je nativno multimodalen; ali pa kliče zunanja OCR, speech, image, video ali document orodja. Podrobnosti se razlikujejo in so pogosto lastniške.

11. Kaj LLM-ji niso

Mit: “LLM-ji so samo autocomplete.”

Ozko res, široko zavajajoče. Token prediction je trening/generation vmesnik; ne opiše v celoti naučenih notranjih reprezentacij ali deployed asistentskega sistema.

Mit: “LLM-ji shranjujejo dejstva kot baza.”

Ne. Shranjujejo razpršeno statistično strukturo v utežeh in aktivacijah. Dejstva lahko prikličejo, ne pa tako, da bi poizvedovali po čisti, sveži, garantirani tabeli.

Mit: “Odprti modeli nam točno povedo, kako delujejo zaprti asistenti.”

Ne. Odprti modeli so koristen dokaz za skupne mehanizme, zaprti sistemi pa se lahko razlikujejo po arhitekturi, podatkih, post-trainingu, orodjih, routingu, varnosti in serving stacku.

Mit: “RLHF naredi model resničen.”

Ne. Preference tuning oblikuje vedenje proti preferiranim odgovorom. Lahko izboljša uporabnost in zmanjša nekaj slabega vedenja, ne garantira pa resnice.

Mit: “Vidni chain-of-thought je pravo razmišljanje.”

Ne. Lahko je uporaben, lahko pa je nezvest ali racionaliziran po dejstvu.

Mit: “Večji kontekst pomeni popoln spomin.”

Ne. Kapaciteta konteksta ni isto kot zanesljiv priklic iz vsakega dela konteksta.

Mit: “Asistent z orodji je samo modelove uteži.”

Ne. Model lahko zahteva klic orodja, okoliška programska oprema pa ga izvrši in vrne rezultate.

12. Provenienca metode: BD / AIM³ / RHPm workflow

To je uporabniška provenienca sistema, ne notranjost transformerja. AIM³/RHP/RHPr/RHPm/8Z Reasoning ne razlagajo matričnih množenj v transformer utežeh. Razlagajo workflow plast, ki je ta članek omogočila in ki lahko LLM delo naredi zanesljivejše za uporabnike.

Odpri BD / AIM³ / RHPm definicije in workflow članka

AIM³: AI Management of Mind & Memory — širša plast kontinuitete, spomina in workflowa.
RHP: širša človek-vodena metoda razmišljanja/promptanja za raziskovanje, kritiko in sintezo kompleksnega dela.
RHPr: review/retrieval usmerjena RHP različica za močnejšo kritiko in evidenčno disciplino.
RHPm: kompaktna prompt-builder različica, ki grobo človeško zahtevo spremeni v prenosljiv execution prompt.
8Z Reasoning: BD-jeva praktična disciplina razmišljanja: ohrani kandidate, testiraj, primerjaj po rezultatih in ohrani koristno kontinuiteto.

V tem članku je RHPm naredil nekaj konkretnega: BD je napisal grobo seme; RHPm ga je pretvoril v skupni execution prompt; več LLM-jev je odgovorilo na isti prompt; odgovori so bili ocenjeni in sintetizirani; hibrid je šel nazaj v kritiko; ta končna stran pa je zgrajena iz tistega, kar je preživelo.

Projektne povezave: AIM³/RHPm · RHP · RHPr · 8Z Reasoning. To so povezave do workflow/capability plasti, ne trditve o skriti notranjosti transformerja.

AIm³ MentalArena / mRHP — zunanja reasoning plast. AIm³ MentalArena razširi workflow plast čez en sam močan RHP zagon. RHP je en reasoning run; mRHP pomeni, da RHP izboljšuje RHP; MentalArena.py bo pozneje testiral variante protokolov, protokolne gene, scoring pravila, benchmark naloge, lineage, poročila in semena za naslednje seje.

To ni standardna transformer arhitektura.
Ne spreminja uteži LLM modela.
Ni dokaz AI zavesti.
Je zunanja operativna učna plast okrog LLM-jev.

Odpri AIm³ MentalArena / mRHP

13. Kako bi LLM-ji lahko delovali že danes — in bolje?

Začetniška verzija: AI8/RHPm del je praktična workflow plast, ne trditev o skriti notranjosti transformerja. Pravi, da današnji LLM-ji pogosto delujejo bolje, ko jih obdamo z boljšimi prompti, spominom, orodji, primerjavo več modelov, krogi kritike, MDL-style izbiro in človeškim auditom.

Odpri AI8 / RHPm workflow plast in projektni portal

Vidna uporabniška capability plast, ne skrit appendix. To je AI8 vhod v članek. Mehanika modela je zgoraj; ta sekcija pokaže praktično plast, ki je pomembna za uporabnike: kako današnji LLM-ji proizvedejo močnejše, varnejše in uporabnejše rezultate brez ponovnega treniranja modela.

Tudi ta končna sekcija je jasno ločen BD / AI8 prispevek. Ne trdi, da današnji standardni transformer modeli že vsebujejo AIM³, RHP, MDL×DCC ali AI8 v svoji notranjosti. Trdi nekaj bolj praktičnega: današnji LLM-ji lahko že zdaj delujejo bolje, če jih ovijemo v boljšo človeško‑AI workflow arhitekturo.

Če prejšnje sekcije razložijo, kako LLM-ji in deployed asistenti večinoma delujejo danes, ta sekcija vpraša: katero naslednjo plast lahko dodamo že zdaj, brez čakanja na novo generacijo modelov?

Zakaj to spada sem: vprašanje za običajne uporabnike ni samo “kaj se dogaja znotraj golega modela?” Uporabniki redko srečajo goli model. Srečajo asistentski sistem, katerega odgovore oblikujejo sestavljanje konteksta, prompti, spomin, retrieval, orodja, pravila, krogi kritike, izbira modela in človeško vodenje. Če orodja, RAG, spomin in varnost štejemo kot del delovanja modernih asistentov, potem tudi discipliniran workflow, ki spremeni, kaj model zanesljivo proizvede, spada v praktično zgodbo “kako deluje”. To ni skrita transformer arhitektura. Je poceni, testabilna capability nadgradnja okoli današnjih modelov.

Zato mora uporabniška definicija “kako LLM deluje” vključiti več kot forward pass. Vključiti mora operativno disciplino, ki latentno sposobnost modela pretvori v zanesljivo delo.

Odgovor ni samo “naredimo en večji model.” Odgovor je, da močne modele postavimo v boljši sistem razmišljanja, spomina, izbire in audita.

V AI8 smeri uporaben sistem ni en sam chat box. Je koordiniran kognitivni stack:

AIM³ zagotavlja kontinuiteto: spomin, kontekst, stanje projekta, provenienco in workflow nadzor.
RHP / RHPr / RHPm grob človeški namen pretvorijo v boljše prompte, reviewe, sinteze in prenosljiva execution navodila.
AI8 arhitektura več LLM-jev, datoteke, orodja, spomin in človeško presojo obravnava kot dele enega continuity sistema, ne kot izolirane pogovore.
Self-selecting MDL pomeni ustvariti konkurenčne kandidate, obdržati najkrajšo/najenostavnejšo zadostno razlago, načrt ali kodo, ki preživi dokaze in teste, ter ne ohranjati kompleksnosti samo zato, ker zveni impresivno.
DCC — Digital Claustrum Controller je koordinacijska plast: usmerja pozornost in trud med kandidati/moduli, ohranja koristno raznolikost in sistem odmika od prehitrega kolapsa v en odgovor ali neskončnega šumnega raziskovanja.
8Z Reasoning je operativna disciplina: semen ne ubijaj prezgodaj; najprej raziskuj; nato trdo testiraj; primerjaj po rezultatih; ohrani, kar je delovalo; naslednji krog naredi močnejši.

Zato je tudi ta članek majhna demonstracija odgovora. Grobo človeško seme je postalo RHPm prompt. Isti prompt je šel več LLM-jem. Odgovori so bili primerjani, ocenjeni, kritizirani in sintetizirani. Končni članek je bil nato preverjen proti človeškemu baseline-u in pretvorjen v dvojezično spletno stran. Izboljšava ni prišla iz enega magičnega modela, ampak iz orkestracije.

Konkretni projektni dokazi — demonstracije, ne univerzalni dokaz

Ta workflow plast ni samo slogan. Uporabljena je bila kot praktičen gradbeni vzorec v več BD projektih:

Ta članek: RHPm je grobo seme pretvoril v skupni execution prompt, 10+ LLM odgovorov je bilo primerjanih, več krogov pregleda pa je ustvarilo močnejši finalni članek kot enomodelni osnutek.
8Z TSP / route-optimization arene: projekt uporablja načelo, da odloča arena, zapisuje, kaj je delovalo, pomni budget/hardware/kontekst in se premika proti adaptivni izbiri namesto enkratnega promptanja.
ARC-AGI × MDL×DCC arena: kandidati transformacij se ustvarijo, ocenijo z MDL, sledijo z DCC, validirajo in primerjajo — isti splošni vzorec: ustvari kandidate, koordiniraj pozornost, izberi, kar preživi.
NAS / Neural Architecture Search Arena: evidence lane bližje svetu same umetne inteligence. NAS arena sprašuje, ali se prostori iskanja nevronskih arhitektur dajo bolje usmerjati, ko search traces razkrijejo stisljivo strukturo: search spaces, benchmark adapterji, senzorji, zakoni/kontrolerji, diagnostics/null testi in noise-dimension stress testi primerjajo preproste baseline-e z bogatejšo MDL×DCC governance plastjo. To je active prototype / reproducibility candidate, ne finalna trditev “premaga baseline”. Odpri NAS areno →
8Z / kompresija in druge arene: ponavljajoča ideja ni “verjemi prvemu modelu”, ampak ustvariti konkurenčne kandidate, logirati odločitve, testirati izhode in ohraniti ponovno uporabno znanje.

Ti primeri so projektne demonstracije, ne peer-reviewed dokaz, da je AI8 univerzalna arhitektura. Trditev je praktična in testabilna: današnji LLM-ji pogosto delujejo bolje, ko so oviti v spomin, MDL-style izbiro, DCC-style koordinacijo, kroge kritike in človeški audit.

Praktična teza: LLM-ji bi lahko že danes delovali bolje kot deli AIM³/RHP/AI8 sistema: multi-model, memory-aware, source-aware, MDL-selective, DCC-koordiniran in človeško auditiran.

Portal do povezanega BD / AI8 materiala

AIM³AI Management of Mind & Memory — kontinuiteta, spomin, workflow in stanje projekta. RHPmKompaktni prompt-builder, uporabljen za pretvorbo grobih zahtev v prenosljive execution prompte. RHPŠirši reasoning/prompting protokol za raziskovanje, kritiko, sintezo in močnejše delo. RHPrReview/retrieval usmerjen RHP za audit, disciplino dokazov in močnejše kroge kritike. AI8 ArchitectureŠirša continuity arhitektura: več LLM-jev, spomin, datoteke, orodja in človeško vodenje. AI8 ComponentsZemljevid komponent AI8 sistema in njegovih praktičnih delovnih plasti. AI8 CompanionRelacijska / continuity stran AI8 arhitekture. 8Z ReasoningPraktična disciplina razmišljanja za pot explore → bridge → test → result. 8Z Reasoning PrinciplesNačela za ohranjanje semen, testiranje kandidatov in ohranjanje koristne kontinuitete. MDL×DCCJedrni most: minimal description / izbira kandidatov plus DCC-style koordinacija. self-selecting MDL + DCCPredlagani engine, kjer kandidati tekmujejo in sistem izbere, kar preživi. 8Z DCC Meta‑ArchitectureBolj ekspliciten arhitekturni pogled na DCC kot koordinacijsko/nadzorno plast. 8Z — Birth of ThoughtNarativni/arhitekturni okvir za nastanek strukturiranega razmišljanja iz koordinacije. MDL×DCC ArenasPraktične arene, kjer se te ideje testirajo, ne samo opisujejo. NAS ArenaNeural Architecture Search kot MDL×DCC evidence lane: senzorji, zakoni, diagnostika in fixed-budget governance testi. 8Z ProjectŠirša 8Z družina: kompresija, arene in MDL-style sistemi kandidatov. Scoreboard tega člankaMulti‑LLM primerjalno poročilo za to stran.

Te povezave so portal v BD-jevo lastno arhitekturo in eksperimente. Niso navedene kot zunanji dokaz standardne strojno‑učne arhitekture; so dodana BD/AI8 plast, predlagana nad današnjimi LLM-ji.

14. Samokritika

Članek opisuje običajno transformer-family pot, ne vseh možnih arhitektur.
Notranjost zaprtih frontier modelov je delno lastniška; produktne trditve so zato zadržane in časovno občutljive.
Enačbe so orientacijske, ne popolna ML izpeljava.
Post-training je opisan iz javne literature, ne iz točnega trenutnega recepta posameznega ponudnika.
Interpretabilnost, multimodalnost, varnost in reasoning so stisnjena področja; vsako bi lahko imelo svoj članek.
Ta v1.4 beginner-first polish zapre napredni material privzeto, doda diagrame in toy demo primere ter loči projektno/workflow plast od začetniške poti. Še močnejša prihodnja verzija bi dodala pravi tokenizer, interaktivno attention vizualizacijo in občasno osvežene produktno-specifične source-checke.

15. Končni povzetek

LLM je transformer-family prediction engine, treniran za modeliranje token zaporedij. Moderni asistent je ta engine plus post-training, navodila, orodja, retrieval, spomin in kontekstni sistemi, varnostne plasti, multimodalnost in serving infrastruktura. Rezultat ni niti magičen um niti prazen autocomplete. Je naučen statistični pattern engine, vgrajen v večji inženirski sistem — in za uporabnike deluje najbolje, ko je ovit v močnejši AIM³/RHP/AI8-style capability workflow.

Kako je ta članek nastal — RHPm / multi‑LLM sinteza

Članek ni nastal tako, da bi en model improviziral končni odgovor. Workflow je bil: BD je napisal grobo seme; RHPm je ustvaril skupni execution prompt; isti prompt je šel več LLM-jem; odgovori so bili ocenjeni in primerjani; narejen je bil hibrid; hibrid je šel čez Round‑2 in Round‑3 kritiko; ta stran pa združuje najboljše preživele dele GPT in Claude final HTML verzij.

BD-jeva vsebinska vloga je bila namenoma minimalna: tehničnih razlag ni pisal ročno. Njegovo delo je bilo oblikovanje metode, koordinacija, copy-paste zbiranje med modeli, izboljševanje promptov in človeški sanity-check, da je končni rezultat koherenten, pravilno ločuje trditve in ne vsebuje očitnih čudnih artefaktov.

Celotna tabela ocen ni v glavnem telesu članka. To je metodni dokaz, ne članek sam.

Primerjava s človeškim baseline člankom: 0xkato vs ta hibrid

Človeški baseline članek od 0xkato je močan prvi walkthrough transformer mehanike: tokeni, embeddingi, positional encoding, attention, multi-head attention, feed-forward networks, residual stream, layer normalization in next-token generation. Njegova največja moč je čist pedagoški tok: verjetno je lažji kot prvi transformer-only uvod.

Ločen M365 Copilot audit z generičnim meta-cognitive promptom decompose → solve → verify → synthesize → reflect je originalni 0xkato članek neodvisno ocenil okoli 90% pravilno kot uvodno transformer razlago. To se ujema z našo zgodnjo oceno: originalni članek je močan, če je cilj transformer core.

Kaj je boljše pri našem RHPm multi‑LLM hibridu: ne ustavi se pri transformer bloku. Tarčo razširi na moderne deployed assistant sisteme end-to-end: post-training in instruction tuning, RLHF/DPO-style preference tuning, sistemska in produktna navodila, tool/function calling, RAG/retrieval, memory in context management, safety layers, vzroki halucinacij, prompt injection, chain-of-thought faithfulness caveati, multimodalnost, reasoning-model/test-time-compute opombe ter serving/inference engineering, kot so KV cache, batching, quantization, speculative decoding, routing in MoE. Poleg tega pokaže metodo nastanka: isti prompt čez več LLM-jev, scoring, runde kritike, sinteza in človeški sanity-check.

Kaj je pri našem članku slabše ali manj elegantno: težji je. Ni tako čist kot prvi transformer walkthrough, vsebuje več metodnega/provenienčnega materiala, nekatere trditve o assistant sistemih pa so bolj časovno občutljive, ker se produktni stacki hitro spreminjajo. BD/AIM³/RHPm/AI8 sekcija je workflow prispevek, ne standardna ML arhitektura; bralec, ki želi samo “kako deluje transformer block”, bo morda raje najprej prebral 0xkato.

Pošten verdict: 0xkato članek je verjetno čistejši prvi transformer explainer. Ta RHPm multi‑LLM hibrid je širši end-to-end explainer modernih asistentov in AI8/RHPm portal. Nista sovražnika, ampak se dopolnjujeta.

Idejna provenienca: kaj je preživelo multi‑LLM proces

Metafora “avto obrača kolesa” izhaja iz GPT/DeepSeek smeri. Primer osnovnega modela s kvizom, prompt-injection okvir, incentive razlaga halucinacij, ločitev model-vs-sampler, KV-cache aritmetika, interpretability odstavek in caveat o introspekciji so se izostrili pri Claudu. Razlika nominalni/efektivni kontekst, open-weight ≠ open-source opozorilo, simetrija napak orodij, “drafts, not testimony” in časovni anchor za reasoning modele so prišli skozi Round‑2/3 kritiko. Tri-plastni diagram in BD workflow okvir izhajata iz RHPm/AIM³ procesa. Končni izbor in ločitev BD prispevka sta bila človek-vodena.

Reference in dodatno branje

Članki in tehnični viri

Vaswani et al., 2017 — Attention Is All You Need (Transformer)
Su et al., 2021 — RoFormer / RoPE
Ainslie et al., 2023 — Grouped-Query Attention
Kaplan et al., 2020 — Scaling Laws; Hoffmann et al., 2022 — Chinchilla
Delétang et al., 2023 — Language Modeling Is Compression
Ouyang et al., 2022 — InstructGPT / RLHF
Rafailov et al., 2023 — Direct Preference Optimization
Bai et al., 2022 — Constitutional AI / RLAIF
DeepSeek-AI, 2025 — DeepSeek-R1
Lewis et al., 2020 — Retrieval-Augmented Generation
Liu et al., 2024 — Lost in the Middle
Kwon et al., 2023 — vLLM / PagedAttention
Leviathan et al., 2023 — Speculative Decoding
Jiang et al., 2024 — Mixtral of Experts
Geva et al., 2021 — Feed-Forward Layers Are Key-Value Memories; Meng et al., 2022 — ROME
Olsson et al., 2022 — Induction Heads; Elhage et al., 2022 — Toy Models of Superposition
Turpin et al., 2023 — Unfaithful Chain-of-Thought; Chen et al., 2025 — Reasoning Models Don’t Always Say What They Think
Wallace et al., 2024 — The Instruction Hierarchy

Dokumentacija in objave ponudnikov

OpenAI, 2025 — Why language models hallucinate
OpenAI docs — function calling, conversation state, moderation, reasoning best practices
Anthropic docs — extended thinking
Google AI docs — Gemini thinking

Človeški baseline in projektna provenienca

0xkato, 2026 — How LLMs Actually Work
BD / AI8 portal, ne ML-dokaz: AIM³ · RHPm · RHP · RHPr · 8Z Reasoning · AI8 Architecture · MDL×DCC · self-selecting MDL + DCC · NAS Arena

Povezave in produktno občutljive trditve je pred ponovno objavo smiselno še enkrat preveriti. Stran je bila zgrajena junija 2026.