RHPm · real-world prompt benchmark · full source

How LLMs Actually Work
prompt benchmark

Full dark HTML rendering of How_LLMs_work.md: one rough question, one RHPm-generated audit prompt, and three answer paths showing what changes when context and prompt discipline change.

Core result: warm project context can rescue a weak prompt; RHPm makes the missing method portable; a cold weak prompt drifts generic.

Why this page exists

The compact RHPm page should stay fast and practical. This page carries the full source text, prompts, answers, and comparison material so the main page can link here without becoming heavy.

A

Warm simple

Weak prompt, good answer, because the AI8 project context already carries BD’s audit style.

B

RHPm prompt

Strong portable audit prompt: read first, cite sources, classify claims, score carefully.

C

Fresh RHPm

Best practical result: structured, sourced, bounded, and reusable in a new session.

D

Cold simple

Generic drift: partially useful, but weaker as a true audit of the exact article.

Scores from the comparison

PathPrompt scoreAnswer scoreMeaning
A — simple prompt inside warm AI8/GPT context40/100 standalone · 60/100 warm-context87/100Weak prompt, strong answer because the surrounding project context already carries the missing method.
B — RHPm-generated audit prompt95/10094/100 as prompt-builder outputExcellent portable prompt that forces reading, sources, claim grouping, severity, percentage range, and reusable verdict.
C — fresh GPT session using RHPm prompt95/10092/100Best practical result. The audit method survives the move into a fresh session.
D — same simple prompt inside cold GPT account35/10052/100Useful generic intuition, but weak audit behavior: too broad, under-sourced, and not anchored enough to the exact article.
Main lesson

RHPm does not add intelligence to the model. It transfers the working discipline into the prompt so another LLM session can reproduce the method without already knowing BD, AI8, or the collaboration style.

Full transcript from How_LLMs_work.md

The blocks below preserve the full source content, converted into readable dark HTML sections. The exact raw Markdown is also included at the end.

A1Warm AI8/GPT context — original simple prompt
The short prompt as written in the warm project environment.

Moj prompt v mojem GPT okolju, kjer ima GPT moja navodila za delo, sem napisal tole:

to preberi in poveč če to drži oz v % koliko % drži? https://www.0xkato.xyz/how-llms-actually-work/ če članka ne vidiš ti ga pripnem

A2Warm AI8/GPT context — GPT answer
The answer produced when the model already had BD / AI8 working context.

GPT odgovor:

Ja — v glavnem drži.

Moja ocena: ~88% pravilno kot uvoden, poljuden članek o tem, kako delujejo moderni transformer-based LLM-i. Če ga bereš kot naslov “to je celotna resnica o vseh LLM-jih in frontier AI sistemih”, potem pade na ~70–75%, ker manjka precej zunanjih/plastnih stvari: post-training, RLHF/DPO, tool use, retrieval, memory sistemi, multimodalnost, reasoning trening, sistemski wrapperji, inference optimizacije, safety pipeline itd.

Ocena po delih

Del članka Koliko drži Komentar
Tokenizacija 90–95% Prav ima: modeli ne berejo besed “direktno”, ampak token ID-je. Malo preveč poenostavi “strawberry/R” problem — tokenizacija je velik razlog, ni pa edini.
Embeddings 85–90% Dobro razloženo. “King − man + woman ≈ queen” je klasična, ampak malo zastarela/romantična razlaga; deluje včasih, ne kot čarobni zakon.
Positional encoding / RoPE 85–90% RoPE opis je dober. Res pa ni univerzalen za vse modele in “long context generalization” ni avtomatska magija; pogosto potrebuje scaling/trening/fine-tuning.
Attention / QKV / softmax 92–95% To je zelo solidno. Osnovni transformer papir dejansko definira attention kot query–key–value mapiranje in weighted sum vrednosti, s scaled dot-product + softmax. (arXiv)
Causal masking 95% Za GPT-style decoder-only modele pravilno: tokeni ne smejo gledati v prihodnje token-e. Original Transformer decoder uporablja maskiranje, da pozicija ne attend-a na kasnejše pozicije. (arXiv)
Multi-head attention 90–95% Dobro popravi pogosto napako: head-i niso nujno “fiksni kosi” originalnega vektorja, ampak learned projections.
Induction heads 80–85% Pravilna ideja, ampak stavek “eden najjasnejših mehanizmov za in-context learning” je malce močan. Anthropicov tekst podpira induction heads kot pomemben mehanizem, a ne kot celotno razlago vseh oblik učenja v kontekstu. (arXiv)
FFN / MLP kot “stored knowledge” 75–85% Smer je pravilna: ROME in causal tracing kažeta, da so nekatere factual associations močno vezane na MLP/FFN module. Ampak znanje ni samo tam; razpršeno je tudi po residual streamu, attentionu, embeddingih in interakcijah med plastmi. (Rome)
Residual stream + layer norm 90–95% To je dobro. Transformer res uporablja residual connections in layer normalization okrog sub-layerjev; moderna pre-norm/RMSNorm zgodba je pravilna kot trend. (arXiv)
Next-token prediction 85–90% Jedro drži: base LLM je treniran na napoved naslednjega tokena. Malo poenostavi: med treningom se logits pogosto računajo za vse pozicije paralelno, ne samo za zadnji token; “zadnji token only” velja predvsem za autoregressive generation.
GQA / MoE / Mixtral 85–90% Smer pravilna. Mistral 7B uporablja GQA za hitrejši inference in manj memory zahteve; Mixtral je res sparse MoE z 8 eksperti, kjer se za vsak token izbereta 2, skupaj 46.7B parametrov in 12.9B aktivnih na token. (arXiv) (Mistral AI)
“Modern LLMs share same skeleton” 80–85% Na visoki ravni drži: tokeni, embeddings, positional info, transformer layers, attention, FFN, residuals, normalization, next-token loop. Članek sam to lepo pove. (0xkato) Ampak frontier sistemi so danes več kot samo goli transformer.

Glavne stvari, ki jih članek dobro zadane

Članek pravilno predstavi “mehaniko jedra”: tokenizacija → embeddingi → positional signal → attention → FFN → residual stream → norm → logits → sampling loop. To je res osnovna hrbtenica večine modernih LLM-ov. Tudi sam članek jasno pravi, da pokriva “core mechanisms” transformer-based LLM-ov in da se modeli razlikujejo po weights, konfiguraciji, podatkih in post-trainingu. (0xkato)

Prav ima tudi glede “lost in the middle”: raziskava Liu et al. je pokazala, da modeli pogosto bolje uporabljajo informacije na začetku in koncu konteksta kot v sredini. (ACL Anthology)

Kje je zavajajoč ali preozek

Največja slabost ni to, da bi pisal neumnosti. Slabost je naslov. “How LLMs Actually Work” zveni kot celotna razlaga, v resnici pa je to dobro napisan transformer explainer.

Manjka mu vsaj to:

  1. Training data + scaling — zakaj velikost, kvaliteta podatkov in compute tako močno spremenijo sposobnosti.
  2. Post-training — SFT, RLHF, DPO, constitutional/safety tuning, reasoning tuning.
  3. Inference sistemi — batching, KV-cache optimizacije, quantization, speculative decoding, routing, server-side orchestration.
  4. Reasoning behavior — kako se pojavljajo chain-of-thought, scratchpad, self-consistency, verifierji, agentni loopi.
  5. Tool/RAG/memory wrappers — ChatGPT/Claude/Gemini kot produkt niso samo model weights; pogosto imajo iskanje, orodja, sistemske instrukcije, memory, safety classifiers.
  6. Multimodalnost — današnji frontier modeli niso več samo tekstovni decoder-only transformatorji.

Moja končna ocena

Kot uvodni tehnični članek: 88/100. Kot popolna razlaga “kako LLM-i dejansko delujejo”: 72/100. Kot vir za razumevanje osnovnega transformer jedra: zelo uporaben.

BD-style brutalno: ni bullshit. Je dober članek. Samo ne smeš ga brati kot “zdaj razumem celoten moderni AI sistem.” Razumeš njegovo jedro, ne pa celotnega organizma.

B0RHPm bridge note
The source note before the RHPm-generated prompt.

Moj prejšnji prompt sem dal v "https://www.mdlxdcc.org/bd/bd_aim3_rhpm" in dobil tale prompt:

B1RHPm-generated audit prompt
The portable prompt generated by BD_AIM3_RHPm.html.

You are an expert technical reviewer of LLM/AI explainer articles.

Your task is to read and evaluate this article:

https://www.0xkato.xyz/how-llms-actually-work/

The user wants to know whether the article is actually correct, and roughly what percentage of it is true / accurate.

Important:

  • Do NOT answer from memory.
  • First open/read the article.
  • If you cannot access the article, clearly say so and ask the user to attach or paste it.
  • Do not pretend you saw the article if you did not.
  • Use web browsing and cite sources.
  • Prefer primary or high-quality sources: OpenAI/Anthropic/Google DeepMind docs or papers, transformer/attention papers, ML textbooks, Distill-style technical explainers, or well-known research papers.
  • The user is not asking for a generic explanation of LLMs. They are asking for a truth/accuracy audit of this specific article.

Goal: Give a practical, honest assessment:

  1. What the article gets right.
  2. What it simplifies but is still acceptable.
  3. What is misleading, incomplete, outdated, or wrong.
  4. A rough percentage score for “how much it holds.”
  5. A short answer the user can reuse when explaining the verdict to someone else.

Use this scoring style:

  • 90–100%: very accurate, only minor simplifications.
  • 75–89%: mostly accurate, but with meaningful simplifications or missing nuance.
  • 60–74%: useful but mixed; several claims need correction.
  • 40–59%: partly useful, but misleading in important ways.
  • Below 40%: mostly wrong or unreliable.

When giving a percentage:

  • Do not make it fake-precise.
  • Use a range if needed, e.g. “about 80–85% accurate.”
  • Explain what the percentage means.
  • Separate factual correctness from pedagogical usefulness.

Suggested workflow:

  1. Read the article and identify its main claims.
  2. Group claims into themes, for example:
    • tokens and embeddings
    • transformer architecture
    • attention mechanism
    • training / pretraining
    • next-token prediction
    • reasoning and “understanding”
    • hallucinations
    • RLHF / instruction tuning
    • memory / context window
    • agents or tool use, if mentioned
  3. For each major claim, classify it:
    • Correct
    • Mostly correct but simplified
    • Misleading
    • Wrong
    • Unclear / unsupported
  4. Compare claims against reliable sources.
  5. Produce an overall score and practical verdict.

Output format:

Verdict on “How LLMs Actually Work”

Overall score

Give a percentage or percentage range, with confidence level.

Example: “Approx. 82–88% accurate, with medium confidence.”

One-paragraph answer

Give the direct human answer first.

What the article gets right

Use concise bullets.

What needs correction or nuance

Use concise bullets. Be specific.

Claim-by-claim audit

Use a table:

Article claim / theme Verdict Why Severity

Severity options:

  • Low: harmless simplification
  • Medium: could mislead readers
  • High: technically wrong or strongly misleading

Best corrected explanation

Give a short, clean explanation of how LLMs actually work, correcting the article’s weak points.

Reusable short answer

Write 3–5 sentences the user can copy-paste to someone else.

Sources

Cite sources inline throughout the answer, not only at the end.

Tone:

  • Direct and honest.
  • Not overly academic.
  • Assume the user is intelligent but may not want dense ML jargon.
  • Do not be smug.
  • Do not dismiss the article just because it is simplified.
  • Reward good simplification when it helps understanding.
C0Fresh GPT session bridge note
The source note before the fresh-session result.

Na podlagi zgornjega prompta je nova GPT seja napisala tole:

C1Fresh GPT session — answer from RHPm prompt
The structured audit produced from the RHPm prompt in a new session.

Verdict on “How LLMs Actually Work”

Overall score

Approx. 84–88% accurate, medium-high confidence.

For the narrow subject it mostly explains — decoder-only transformer architecture — I’d score it closer to 88–92%. For the broader promise implied by the title — how modern deployed LLM assistants actually work end-to-end — I’d score it closer to 75–82%, because it under-covers training data, post-training, hallucinations, retrieval/tools, safety layers, evaluation, and serving/inference systems. Overall: about 85% accurate.

The article is pedagogically useful and mostly technically sound, but it is more accurately titled “How Transformer LLM Architecture Works” than “How LLMs Actually Work.” The article itself focuses on tokens, embeddings, positional encodings, attention, FFNs, residual streams, next-token prediction, KV cache, GQA, MoE, and related architecture details. (0xkato)

One-paragraph answer

The article mostly holds. It gives a good, practical explanation of the internal transformer stack: text becomes tokens, tokens become vectors, masked self-attention mixes information from prior context, feed-forward layers transform token representations, and the model generates by predicting the next token. Those points match standard explanations of tokenization, transformer attention, and autoregressive decoding. (OpenAI Help Center) The weak points are that it slightly overstates some intuitions — especially tokenization as the reason for counting failures, embedding-vector analogies, induction heads, and “facts in FFNs” — and it does not give enough space to post-training, hallucinations, tool use, retrieval, safety policies, and product-level systems, which strongly shape ChatGPT/Claude/Gemini-style assistants. (arXiv)

What the article gets right

  • Tokens first, text second. LLMs process token IDs, not raw human text directly. OpenAI’s own docs describe tokens as chunks such as characters, words, punctuation, or partial words, and tiktoken is OpenAI’s BPE tokenizer library. (OpenAI Help Center)

  • Subword tokenization is real and important. The article’s BPE/SentencePiece discussion is broadly correct. SentencePiece is a standard subword tokenizer/detokenizer system used in neural text processing. (ACL Anthology)

  • The transformer explanation is mostly correct. The article’s Q/K/V, scaled dot-product attention, softmax-weighted value mixing, and causal masking are aligned with standard transformer explanations. (nlp.seas.harvard.edu)

  • Multi-head attention is explained well. It correctly says heads use learned projections rather than simply fixed slices of the vector. That is an important detail many beginner explainers get wrong. (nlp.seas.harvard.edu)

  • RoPE is described in the right direction. Rotary positional embeddings encode position through rotations and make relative-position information available inside self-attention. (arXiv)

  • The FFN/MLP section is one of the stronger parts. There is real research support for feed-forward layers acting partly like key-value memories and for mid-layer FF modules playing an important role in factual associations. (ACL Anthology)

  • The article correctly separates architecture from weights. The transformer skeleton is not the intelligence by itself; the learned weights, training data, objective, scale, post-training, and inference setup matter enormously. The article states this in simplified but valid form. (0xkato)

  • GQA, KV cache, MoE, and speculative decoding are real modern inference/architecture ideas. The article’s inclusion of these topics is a strength, not fluff. Mistral 7B, for example, explicitly uses grouped-query attention to speed inference and reduce memory during decoding. (arXiv) Mixtral uses sparse mixture-of-experts routing, where each token is routed to selected experts rather than all FFN experts. (arXiv) Speculative decoding is also a real method for accelerating autoregressive generation. (arXiv)

What needs correction or nuance

  • It is not a full “how LLMs actually work” article. It is mainly an architecture explainer. It does not deeply explain data curation, pretraining loss, scaling, optimizer dynamics, instruction tuning, RLHF/RLAIF/DPO, safety policies, system prompts, retrieval, tool use, evaluation, or deployment.

  • Tokenization is overused as an explanation for mistakes. Tokenization helps explain why letter-level tasks like counting “r” in “strawberry” can be unintuitive, but it is not the whole story. Exact character counting also requires reliable symbolic/algorithmic behavior, not just better tokens. OpenAI’s docs support the token-level framing, but not the stronger claim that tokenization alone explains these failures. (OpenAI Help Center)

  • The embedding analogy is useful but oversold. “king − man + woman ≈ queen” is a classic word-vector intuition, but modern LLM behavior depends far more on contextual hidden states across many layers than on the initial embedding lookup alone. Word-vector research supports semantic vector relationships, but also notes limitations such as insensitivity to word order in simpler embedding systems. (arXiv)

  • RoPE and long context are not magic. RoPE helps with positional representation and relative-position behavior, but long-context models can still fail depending on where relevant information appears. The “Lost in the Middle” paper shows performance can degrade when relevant information is placed in the middle of long contexts. (arXiv)

  • Induction heads are real, but the article overstates certainty. Induction heads are an important mechanistic-interpretability finding, but the strongest causal evidence is in small attention-only models; for larger models with MLPs, evidence is more correlational and incomplete. (arXiv)

  • “Facts live in FFNs” needs nuance. FFNs contain a lot of model capacity and can encode factual associations, but LLM knowledge is distributed across layers, attention, FFNs, activations, and residual streams. ROME-style edits show important evidence, but they do not mean there is a clean database row for every fact. (arXiv)

  • Post-training deserves more weight. Modern assistants are not just pretrained next-token predictors. Instruction tuning and human/preference feedback strongly shape helpfulness, truthfulness, refusal behavior, tone, and task-following. InstructGPT showed that fine-tuning with human feedback improved alignment with user intent and helped reduce toxic/untruthful outputs, while still leaving mistakes. (arXiv)

  • Hallucinations are under-explained. The article mentions generation mechanics, but a serious “how LLMs work” explainer should explain why models can produce plausible but false claims. OpenAI describes hallucinations as plausible but false outputs that persist partly because evaluation setups often reward guessing over calibrated uncertainty. (OpenAI)

Claim-by-claim audit

Article claim / theme Verdict Why Severity
Text is converted into token IDs before the model processes it. Correct This matches OpenAI’s explanation that models process text as tokens, and tiktoken explicitly converts text into token IDs. (OpenAI Help Center) Low
BPE/SentencePiece-style subword tokenization explains why words are split strangely. Correct BPE and SentencePiece are standard subword-tokenization approaches. The article’s examples are directionally right. (GitHub) Low
Letter-counting mistakes are mainly because the model sees tokens, not letters. Mostly correct but simplified Tokenization is part of the explanation, but exact counting also requires robust algorithmic/symbolic behavior. The article makes this sound cleaner than it is. Medium
Embedding vectors give token IDs learned meaning. Correct The embedding lookup idea is standard. The article’s intuition is fine. Low
Similar concepts live near each other in vector space; “king − man + woman ≈ queen.” Mostly correct but simplified This is a useful word-vector analogy, but it is not a precise guarantee for modern LLM token embeddings. Contextual representations after many layers matter more. (arXiv) Medium
Self-attention needs positional information because attention alone has no sequence order. Correct Transformer attention needs position information to represent order. RoPE is a modern way to encode positional structure. (nlp.seas.harvard.edu) Low
RoPE is widely used and helps relative-position behavior. Correct with nuance RoPE does incorporate relative-position dependency into attention, but long-context reliability is still not guaranteed. (arXiv) Low
Attention works through Q/K/V, scaled dot products, softmax weights, and value mixing. Correct This matches the standard scaled dot-product attention formulation. (nlp.seas.harvard.edu) Low
Causal masking prevents a decoder model from looking at future tokens. Correct This is standard decoder self-attention behavior. (nlp.seas.harvard.edu) Low
Multi-head attention uses learned projections, not fixed slices. Correct The article gets this important detail right. (nlp.seas.harvard.edu) Low
Attention heads can become specialized circuits, including induction heads. Mostly correct but simplified Induction heads are real, but their general role in large-model in-context learning is not fully settled. The source paper itself says the strongest causal evidence is in small attention-only transformers and larger-model evidence is more indirect. (arXiv) Medium
GQA and KV cache reduce inference memory/compute pressure. Correct Mistral 7B explicitly uses grouped-query attention for faster inference and reduced memory during decoding. (arXiv) Low
Attention is expensive and scales badly with context length. Correct with nuance Full attention has quadratic cost in sequence length; during cached autoregressive decoding, each new token attends over prior context, so the practical cost profile depends on prefill vs decode. Low
FFNs expand, apply nonlinear transformation, and compress back. Correct This is the standard transformer MLP/FFN structure. Low
FFNs store semantic/factual structure. Mostly correct but simplified There is strong evidence FF layers act partly like key-value memories and mid-layer FF modules matter for factual associations, but knowledge is distributed, not a simple database. (ACL Anthology) Medium
MoE models route tokens to selected experts. Correct Mixtral uses sparse MoE layers where a router selects experts per token. (arXiv) Low
Residual connections and normalization stabilize deep transformer training. Correct with simplification LayerNorm/RMSNorm are important stabilizers. RMSNorm keeps rescaling while dropping recentering, improving efficiency while remaining comparable. (arXiv) Low
LLM generation is next-token prediction repeated in a loop. Correct Autoregressive generation does work this way at inference. The nuance is that pretraining computes loss over many token positions efficiently in parallel. Low
Decoding settings like temperature/top-p change output behavior. Correct This is standard sampling behavior. The article’s explanation is acceptable. Low
Speculative decoding can speed generation. Correct Speculative decoding is a real technique that can accelerate autoregressive sampling without changing the output distribution, under the method’s assumptions. (arXiv) Low
Post-training exists after base pretraining. Correct but incomplete The article mentions it, but does not explain how much instruction tuning and human/preference feedback shape assistant behavior. (arXiv) Medium
“Modern LLMs mostly share the same transformer-family skeleton.” Mostly correct Broadly true for many major text LLMs, especially decoder-only transformer systems, but proprietary frontier details are not fully public and some newer systems use additional architectures, tools, retrieval, or multimodal components. Medium
Hallucinations are not central to the article. Incomplete For a full “how LLMs work” explainer, hallucination mechanisms should be covered. OpenAI describes hallucinations as plausible but false model outputs and connects their persistence to training/evaluation incentives. (OpenAI) Medium

Best corrected explanation

A modern text LLM usually starts by tokenizing input text into token IDs. Those IDs are mapped into learned vectors, combined with positional information, and passed through many transformer layers. Inside each layer, masked self-attention lets each token mix information from earlier tokens using learned query, key, and value projections; feed-forward/MLP layers then transform each token representation and provide much of the model’s capacity. Residual connections and normalization keep very deep stacks trainable. (OpenAI Help Center)

During pretraining, the model’s weights are adjusted so it becomes better at predicting missing or next tokens across huge text/code/multimodal corpora. During inference, a decoder-style LLM repeatedly predicts a distribution over the next token, samples or selects one, appends it, and continues. KV caching, grouped-query attention, MoE routing, and speculative decoding are engineering/architecture techniques that make this process cheaper or faster. (arXiv)

But deployed assistants are more than base next-token predictors. They are shaped by instruction tuning, preference/RLHF-style training, system prompts, safety policies, retrieval, tools, and product orchestration. They do not store a clean database of facts; knowledge and skills are distributed across weights and activations. Because they generate plausible continuations rather than directly checking truth by default, they can reason impressively but also hallucinate. (arXiv)

Reusable short answer

The article is mostly right — I’d call it about 85% accurate. It explains transformer LLM architecture well: tokens, embeddings, attention, FFNs, residual/norm layers, next-token generation, KV cache, GQA, and MoE are mostly described correctly. The main weakness is that it over-simplifies a few mechanisms and under-explains post-training, hallucinations, tools/RAG, and how deployed AI assistants are actually shaped. So it is a good article, but more precisely it explains how transformer LLM architecture works, not the full story of modern AI assistants.

Sources

Main sources used: the audited article itself; OpenAI tokenization docs and tiktoken; transformer attention explanations; RoPE; Lost in the Middle; Mistral/Mixtral architecture papers; FFN memory and ROME papers; RMSNorm; InstructGPT/RLHF; speculative decoding; and OpenAI’s hallucination discussion. (0xkato)

D1Cold GPT account — same simple prompt
The same short prompt given where GPT did not know BD or the AI8 project style.

V drugem GPT okolju, kjer GPT nima načina dela, sem spet vprašal GPT tole:

to preberi in poveč če to drži oz v % koliko % drži? https://www.0xkato.xyz/how-llms-actually-work/ če članka ne vidiš ti ga pripnem

D2Cold GPT account — GPT answer
The weaker generic answer produced from the cold simple prompt.

GPT je odgovoril tole:

Na kratko: članek verjetno zadane približno 70–85 % resnice, odvisno od tega, kako kategorične trditve daje. Če je njegovo glavno sporočilo:

"LLM je v osnovi stroj za napoved naslednjega tokena, ki preko attention mehanizma gradi kompleksno vedenje."

potem je to precej pravilno. (Knovo)

Ampak če članek trdi ali namiguje:

"LLM je samo next-token predictor in nič več."

potem postane stvar precej bolj vprašljiva in bi rekel 50–60 % pravilno. Tudi med raziskovalci je vedno več argumentov, da je to tehnično res na nivoju učnega cilja, ni pa dobra razlaga tega, kaj se je model dejansko naučil. (arXiv)

Moja ocena po posameznih točkah:

Trditev Moja ocena
LLM generira besedilo token za tokenom 100 %
Trening temelji na napovedi naslednjega tokena 100 %
Attention je ključni mehanizem za kontekst 95 %
Model nima baze dejstev kot SQL baza 95 %
"Razumevanje" nastane kot emergentna lastnost 80–90 %
LLM je samo zelo napreden autocomplete 40–60 %
V njem ni nobenega notranjega modela sveta 20–50 %
Razmišljanje je zgolj iluzija 30–70 % (odvisno od definicije razmišljanja)

Največji problem takšnih člankov je običajno zamenjava:

učni cilj ≠ notranja struktura

To je podobno kot:

"Človek je samo organizem, ki maksimira reprodukcijo."

V evolucijskem smislu je to res. Ampak iz tega ne moreš razložiti matematike, glasbe, filozofije ali znanosti.

Podobno:

"LLM napoveduje naslednji token."

To je res.

Ampak iz tega samodejno ne sledi, da v mreži ni nastala abstraktna reprezentacija konceptov, prostorskih odnosov, logike, vzročnosti itd. Ravno nasprotno – veliko raziskav kaže, da se med treningom oblikujejo zelo bogate notranje reprezentacije. (arXiv)

Glede na to, kar si mi včeraj kazal za RHPr, RHP in AIm3, bi tebi verjetno najbolj padel v oči ravno ta del:

Članek verjetno opisuje mehaniko generiranja, ne pa narave reprezentacije.

To je podobno kot opisati možgane z:

"Nevron sproži naslednji akcijski potencial."

Tehnično pravilno.

Ampak ne pojasni, kako iz tega nastanejo matematika, zavest ali kreativnost.

Če mi prilepiš originalni članek (ali PDF ali HTML), ga lahko preberem celega in ti dam precej bolj natančno oceno v stilu:

  • kaj drži,
  • kaj je poenostavljeno,
  • kaj je zastarelo,
  • kaj je sporno med raziskovalci,
  • in predvsem kako se ujema ali ne ujema s tvojim RHPr/AIm3 pogledom.

Exact raw Markdown source

This is the complete original text from How_LLMs_work.md, kept as a raw source block for auditability.

Moj prompt v mojem GPT okolju, kjer ima GPT moja navodila za delo, sem napisal tole:

to preberi in poveč če to drži oz v % koliko % drži?
https://www.0xkato.xyz/how-llms-actually-work/
če članka ne vidiš ti ga pripnem



==================================================================================

GPT odgovor:

Ja — **v glavnem drži**.

Moja ocena: **~88% pravilno** kot uvoden, poljuden članek o tem, kako delujejo moderni **transformer-based LLM-i**.
Če ga bereš kot naslov “to je celotna resnica o vseh LLM-jih in frontier AI sistemih”, potem pade na **~70–75%**, ker manjka precej zunanjih/plastnih stvari: post-training, RLHF/DPO, tool use, retrieval, memory sistemi, multimodalnost, reasoning trening, sistemski wrapperji, inference optimizacije, safety pipeline itd.

### Ocena po delih

| Del članka                        | Koliko drži | Komentar                                                                                                                                                                                                                                               |
| --------------------------------- | ----------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Tokenizacija                      |  **90–95%** | Prav ima: modeli ne berejo besed “direktno”, ampak token ID-je. Malo preveč poenostavi “strawberry/R” problem — tokenizacija je velik razlog, ni pa edini.                                                                                             |
| Embeddings                        |  **85–90%** | Dobro razloženo. “King − man + woman ≈ queen” je klasična, ampak malo zastarela/romantična razlaga; deluje včasih, ne kot čarobni zakon.                                                                                                               |
| Positional encoding / RoPE        |  **85–90%** | RoPE opis je dober. Res pa ni univerzalen za vse modele in “long context generalization” ni avtomatska magija; pogosto potrebuje scaling/trening/fine-tuning.                                                                                          |
| Attention / QKV / softmax         |  **92–95%** | To je zelo solidno. Osnovni transformer papir dejansko definira attention kot query–key–value mapiranje in weighted sum vrednosti, s scaled dot-product + softmax. ([arXiv][1])                                                                        |
| Causal masking                    |     **95%** | Za GPT-style decoder-only modele pravilno: tokeni ne smejo gledati v prihodnje token-e. Original Transformer decoder uporablja maskiranje, da pozicija ne attend-a na kasnejše pozicije. ([arXiv][1])                                                  |
| Multi-head attention              |  **90–95%** | Dobro popravi pogosto napako: head-i niso nujno “fiksni kosi” originalnega vektorja, ampak learned projections.                                                                                                                                        |
| Induction heads                   |  **80–85%** | Pravilna ideja, ampak stavek “eden najjasnejših mehanizmov za in-context learning” je malce močan. Anthropicov tekst podpira induction heads kot pomemben mehanizem, a ne kot celotno razlago vseh oblik učenja v kontekstu. ([arXiv][2])              |
| FFN / MLP kot “stored knowledge”  |  **75–85%** | Smer je pravilna: ROME in causal tracing kažeta, da so nekatere factual associations močno vezane na MLP/FFN module. Ampak znanje ni samo tam; razpršeno je tudi po residual streamu, attentionu, embeddingih in interakcijah med plastmi. ([Rome][3]) |
| Residual stream + layer norm      |  **90–95%** | To je dobro. Transformer res uporablja residual connections in layer normalization okrog sub-layerjev; moderna pre-norm/RMSNorm zgodba je pravilna kot trend. ([arXiv][1])                                                                             |
| Next-token prediction             |  **85–90%** | Jedro drži: base LLM je treniran na napoved naslednjega tokena. Malo poenostavi: med treningom se logits pogosto računajo za vse pozicije paralelno, ne samo za zadnji token; “zadnji token only” velja predvsem za autoregressive generation.         |
| GQA / MoE / Mixtral               |  **85–90%** | Smer pravilna. Mistral 7B uporablja GQA za hitrejši inference in manj memory zahteve; Mixtral je res sparse MoE z 8 eksperti, kjer se za vsak token izbereta 2, skupaj 46.7B parametrov in 12.9B aktivnih na token. ([arXiv][4]) ([Mistral AI][5])     |
| “Modern LLMs share same skeleton” |  **80–85%** | Na visoki ravni drži: tokeni, embeddings, positional info, transformer layers, attention, FFN, residuals, normalization, next-token loop. Članek sam to lepo pove. ([0xkato][6]) Ampak frontier sistemi so danes več kot samo goli transformer.        |

### Glavne stvari, ki jih članek dobro zadane

Članek pravilno predstavi “mehaniko jedra”: tokenizacija → embeddingi → positional signal → attention → FFN → residual stream → norm → logits → sampling loop. To je res osnovna hrbtenica večine modernih LLM-ov. Tudi sam članek jasno pravi, da pokriva “core mechanisms” transformer-based LLM-ov in da se modeli razlikujejo po weights, konfiguraciji, podatkih in post-trainingu. ([0xkato][6])

Prav ima tudi glede “lost in the middle”: raziskava Liu et al. je pokazala, da modeli pogosto bolje uporabljajo informacije na začetku in koncu konteksta kot v sredini. ([ACL Anthology][7])

### Kje je zavajajoč ali preozek

Največja slabost ni to, da bi pisal neumnosti. Slabost je **naslov**. “How LLMs Actually Work” zveni kot celotna razlaga, v resnici pa je to **dobro napisan transformer explainer**.

Manjka mu vsaj to:

1. **Training data + scaling** — zakaj velikost, kvaliteta podatkov in compute tako močno spremenijo sposobnosti.
2. **Post-training** — SFT, RLHF, DPO, constitutional/safety tuning, reasoning tuning.
3. **Inference sistemi** — batching, KV-cache optimizacije, quantization, speculative decoding, routing, server-side orchestration.
4. **Reasoning behavior** — kako se pojavljajo chain-of-thought, scratchpad, self-consistency, verifierji, agentni loopi.
5. **Tool/RAG/memory wrappers** — ChatGPT/Claude/Gemini kot produkt niso samo model weights; pogosto imajo iskanje, orodja, sistemske instrukcije, memory, safety classifiers.
6. **Multimodalnost** — današnji frontier modeli niso več samo tekstovni decoder-only transformatorji.

### Moja končna ocena

**Kot uvodni tehnični članek: 88/100.**
**Kot popolna razlaga “kako LLM-i dejansko delujejo”: 72/100.**
**Kot vir za razumevanje osnovnega transformer jedra: zelo uporaben.**

BD-style brutalno: **ni bullshit. Je dober članek.** Samo ne smeš ga brati kot “zdaj razumem celoten moderni AI sistem.” Razumeš njegovo **jedro**, ne pa celotnega organizma.

[1]: https://arxiv.org/html/1706.03762v7 "Attention Is All You Need"
[2]: https://arxiv.org/abs/2209.11895?utm_source=chatgpt.com "In-context Learning and Induction Heads"
[3]: https://rome.baulab.info/ "Locating and Editing Factual Associations in GPT"
[4]: https://arxiv.org/html/2310.06825 "Mistral 7B"
[5]: https://mistral.ai/news/mixtral-of-experts/ "Mixtral of experts | Mistral AI"
[6]: https://www.0xkato.xyz/how-llms-actually-work/ "How LLMs Actually Work | 0xkato"
[7]: https://aclanthology.org/2024.tacl-1.9/ "Lost in the Middle: How Language Models Use Long Contexts - ACL Anthology"






==================================================================================

Moj prejšnji prompt sem dal v "https://www.mdlxdcc.org/bd/bd_aim3_rhpm" in dobil tale prompt:

==================================================================================

You are an expert technical reviewer of LLM/AI explainer articles.

Your task is to read and evaluate this article:

https://www.0xkato.xyz/how-llms-actually-work/

The user wants to know whether the article is actually correct, and roughly **what percentage of it is true / accurate**.

Important:
- Do NOT answer from memory.
- First open/read the article.
- If you cannot access the article, clearly say so and ask the user to attach or paste it.
- Do not pretend you saw the article if you did not.
- Use web browsing and cite sources.
- Prefer primary or high-quality sources: OpenAI/Anthropic/Google DeepMind docs or papers, transformer/attention papers, ML textbooks, Distill-style technical explainers, or well-known research papers.
- The user is not asking for a generic explanation of LLMs. They are asking for a truth/accuracy audit of this specific article.

Goal:
Give a practical, honest assessment:
1. What the article gets right.
2. What it simplifies but is still acceptable.
3. What is misleading, incomplete, outdated, or wrong.
4. A rough percentage score for “how much it holds.”
5. A short answer the user can reuse when explaining the verdict to someone else.

Use this scoring style:
- **90–100%**: very accurate, only minor simplifications.
- **75–89%**: mostly accurate, but with meaningful simplifications or missing nuance.
- **60–74%**: useful but mixed; several claims need correction.
- **40–59%**: partly useful, but misleading in important ways.
- **Below 40%**: mostly wrong or unreliable.

When giving a percentage:
- Do not make it fake-precise.
- Use a range if needed, e.g. “about 80–85% accurate.”
- Explain what the percentage means.
- Separate factual correctness from pedagogical usefulness.

Suggested workflow:
1. Read the article and identify its main claims.
2. Group claims into themes, for example:
   - tokens and embeddings
   - transformer architecture
   - attention mechanism
   - training / pretraining
   - next-token prediction
   - reasoning and “understanding”
   - hallucinations
   - RLHF / instruction tuning
   - memory / context window
   - agents or tool use, if mentioned
3. For each major claim, classify it:
   - Correct
   - Mostly correct but simplified
   - Misleading
   - Wrong
   - Unclear / unsupported
4. Compare claims against reliable sources.
5. Produce an overall score and practical verdict.

Output format:

# Verdict on “How LLMs Actually Work”

## Overall score
Give a percentage or percentage range, with confidence level.

Example:
“Approx. 82–88% accurate, with medium confidence.”

## One-paragraph answer
Give the direct human answer first.

## What the article gets right
Use concise bullets.

## What needs correction or nuance
Use concise bullets. Be specific.

## Claim-by-claim audit
Use a table:

| Article claim / theme | Verdict | Why | Severity |
|---|---|---|---|

Severity options:
- Low: harmless simplification
- Medium: could mislead readers
- High: technically wrong or strongly misleading

## Best corrected explanation
Give a short, clean explanation of how LLMs actually work, correcting the article’s weak points.

## Reusable short answer
Write 3–5 sentences the user can copy-paste to someone else.

## Sources
Cite sources inline throughout the answer, not only at the end.

Tone:
- Direct and honest.
- Not overly academic.
- Assume the user is intelligent but may not want dense ML jargon.
- Do not be smug.
- Do not dismiss the article just because it is simplified.
- Reward good simplification when it helps understanding.





==================================================================================

Na podlagi zgornjega prompta je nova GPT seja napisala tole:

==================================================================================

# Verdict on “How LLMs Actually Work”

## Overall score

**Approx. 84–88% accurate, medium-high confidence.**

For the narrow subject it mostly explains — **decoder-only transformer architecture** — I’d score it closer to **88–92%**. For the broader promise implied by the title — **how modern deployed LLM assistants actually work end-to-end** — I’d score it closer to **75–82%**, because it under-covers training data, post-training, hallucinations, retrieval/tools, safety layers, evaluation, and serving/inference systems. Overall: **about 85% accurate**.

The article is pedagogically useful and mostly technically sound, but it is more accurately titled **“How Transformer LLM Architecture Works”** than “How LLMs Actually Work.” The article itself focuses on tokens, embeddings, positional encodings, attention, FFNs, residual streams, next-token prediction, KV cache, GQA, MoE, and related architecture details. ([0xkato][1])

## One-paragraph answer

The article mostly holds. It gives a good, practical explanation of the internal transformer stack: text becomes tokens, tokens become vectors, masked self-attention mixes information from prior context, feed-forward layers transform token representations, and the model generates by predicting the next token. Those points match standard explanations of tokenization, transformer attention, and autoregressive decoding. ([OpenAI Help Center][2]) The weak points are that it slightly overstates some intuitions — especially tokenization as the reason for counting failures, embedding-vector analogies, induction heads, and “facts in FFNs” — and it does not give enough space to post-training, hallucinations, tool use, retrieval, safety policies, and product-level systems, which strongly shape ChatGPT/Claude/Gemini-style assistants. ([arXiv][3])

## What the article gets right

* **Tokens first, text second.** LLMs process token IDs, not raw human text directly. OpenAI’s own docs describe tokens as chunks such as characters, words, punctuation, or partial words, and `tiktoken` is OpenAI’s BPE tokenizer library. ([OpenAI Help Center][2])

* **Subword tokenization is real and important.** The article’s BPE/SentencePiece discussion is broadly correct. SentencePiece is a standard subword tokenizer/detokenizer system used in neural text processing. ([ACL Anthology][4])

* **The transformer explanation is mostly correct.** The article’s Q/K/V, scaled dot-product attention, softmax-weighted value mixing, and causal masking are aligned with standard transformer explanations. ([nlp.seas.harvard.edu][5])

* **Multi-head attention is explained well.** It correctly says heads use learned projections rather than simply fixed slices of the vector. That is an important detail many beginner explainers get wrong. ([nlp.seas.harvard.edu][5])

* **RoPE is described in the right direction.** Rotary positional embeddings encode position through rotations and make relative-position information available inside self-attention. ([arXiv][6])

* **The FFN/MLP section is one of the stronger parts.** There is real research support for feed-forward layers acting partly like key-value memories and for mid-layer FF modules playing an important role in factual associations. ([ACL Anthology][7])

* **The article correctly separates architecture from weights.** The transformer skeleton is not the intelligence by itself; the learned weights, training data, objective, scale, post-training, and inference setup matter enormously. The article states this in simplified but valid form. ([0xkato][1])

* **GQA, KV cache, MoE, and speculative decoding are real modern inference/architecture ideas.** The article’s inclusion of these topics is a strength, not fluff. Mistral 7B, for example, explicitly uses grouped-query attention to speed inference and reduce memory during decoding. ([arXiv][8]) Mixtral uses sparse mixture-of-experts routing, where each token is routed to selected experts rather than all FFN experts. ([arXiv][9]) Speculative decoding is also a real method for accelerating autoregressive generation. ([arXiv][10])

## What needs correction or nuance

* **It is not a full “how LLMs actually work” article.** It is mainly an architecture explainer. It does not deeply explain data curation, pretraining loss, scaling, optimizer dynamics, instruction tuning, RLHF/RLAIF/DPO, safety policies, system prompts, retrieval, tool use, evaluation, or deployment.

* **Tokenization is overused as an explanation for mistakes.** Tokenization helps explain why letter-level tasks like counting “r” in “strawberry” can be unintuitive, but it is not the whole story. Exact character counting also requires reliable symbolic/algorithmic behavior, not just better tokens. OpenAI’s docs support the token-level framing, but not the stronger claim that tokenization alone explains these failures. ([OpenAI Help Center][2])

* **The embedding analogy is useful but oversold.** “king − man + woman ≈ queen” is a classic word-vector intuition, but modern LLM behavior depends far more on contextual hidden states across many layers than on the initial embedding lookup alone. Word-vector research supports semantic vector relationships, but also notes limitations such as insensitivity to word order in simpler embedding systems. ([arXiv][11])

* **RoPE and long context are not magic.** RoPE helps with positional representation and relative-position behavior, but long-context models can still fail depending on where relevant information appears. The “Lost in the Middle” paper shows performance can degrade when relevant information is placed in the middle of long contexts. ([arXiv][6])

* **Induction heads are real, but the article overstates certainty.** Induction heads are an important mechanistic-interpretability finding, but the strongest causal evidence is in small attention-only models; for larger models with MLPs, evidence is more correlational and incomplete. ([arXiv][12])

* **“Facts live in FFNs” needs nuance.** FFNs contain a lot of model capacity and can encode factual associations, but LLM knowledge is distributed across layers, attention, FFNs, activations, and residual streams. ROME-style edits show important evidence, but they do not mean there is a clean database row for every fact. ([arXiv][13])

* **Post-training deserves more weight.** Modern assistants are not just pretrained next-token predictors. Instruction tuning and human/preference feedback strongly shape helpfulness, truthfulness, refusal behavior, tone, and task-following. InstructGPT showed that fine-tuning with human feedback improved alignment with user intent and helped reduce toxic/untruthful outputs, while still leaving mistakes. ([arXiv][3])

* **Hallucinations are under-explained.** The article mentions generation mechanics, but a serious “how LLMs work” explainer should explain why models can produce plausible but false claims. OpenAI describes hallucinations as plausible but false outputs that persist partly because evaluation setups often reward guessing over calibrated uncertainty. ([OpenAI][14])

## Claim-by-claim audit

| Article claim / theme                                                                      | Verdict                       | Why                                                                                                                                                                                                                                                                    | Severity |
| ------------------------------------------------------------------------------------------ | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| Text is converted into token IDs before the model processes it.                            | Correct                       | This matches OpenAI’s explanation that models process text as tokens, and `tiktoken` explicitly converts text into token IDs. ([OpenAI Help Center][2])                                                                                                                | Low      |
| BPE/SentencePiece-style subword tokenization explains why words are split strangely.       | Correct                       | BPE and SentencePiece are standard subword-tokenization approaches. The article’s examples are directionally right. ([GitHub][15])                                                                                                                                     | Low      |
| Letter-counting mistakes are mainly because the model sees tokens, not letters.            | Mostly correct but simplified | Tokenization is part of the explanation, but exact counting also requires robust algorithmic/symbolic behavior. The article makes this sound cleaner than it is.                                                                                                       | Medium   |
| Embedding vectors give token IDs learned meaning.                                          | Correct                       | The embedding lookup idea is standard. The article’s intuition is fine.                                                                                                                                                                                                | Low      |
| Similar concepts live near each other in vector space; “king − man + woman ≈ queen.”       | Mostly correct but simplified | This is a useful word-vector analogy, but it is not a precise guarantee for modern LLM token embeddings. Contextual representations after many layers matter more. ([arXiv][11])                                                                                       | Medium   |
| Self-attention needs positional information because attention alone has no sequence order. | Correct                       | Transformer attention needs position information to represent order. RoPE is a modern way to encode positional structure. ([nlp.seas.harvard.edu][5])                                                                                                                  | Low      |
| RoPE is widely used and helps relative-position behavior.                                  | Correct with nuance           | RoPE does incorporate relative-position dependency into attention, but long-context reliability is still not guaranteed. ([arXiv][6])                                                                                                                                  | Low      |
| Attention works through Q/K/V, scaled dot products, softmax weights, and value mixing.     | Correct                       | This matches the standard scaled dot-product attention formulation. ([nlp.seas.harvard.edu][5])                                                                                                                                                                        | Low      |
| Causal masking prevents a decoder model from looking at future tokens.                     | Correct                       | This is standard decoder self-attention behavior. ([nlp.seas.harvard.edu][5])                                                                                                                                                                                          | Low      |
| Multi-head attention uses learned projections, not fixed slices.                           | Correct                       | The article gets this important detail right. ([nlp.seas.harvard.edu][5])                                                                                                                                                                                              | Low      |
| Attention heads can become specialized circuits, including induction heads.                | Mostly correct but simplified | Induction heads are real, but their general role in large-model in-context learning is not fully settled. The source paper itself says the strongest causal evidence is in small attention-only transformers and larger-model evidence is more indirect. ([arXiv][12]) | Medium   |
| GQA and KV cache reduce inference memory/compute pressure.                                 | Correct                       | Mistral 7B explicitly uses grouped-query attention for faster inference and reduced memory during decoding. ([arXiv][8])                                                                                                                                               | Low      |
| Attention is expensive and scales badly with context length.                               | Correct with nuance           | Full attention has quadratic cost in sequence length; during cached autoregressive decoding, each new token attends over prior context, so the practical cost profile depends on prefill vs decode.                                                                    | Low      |
| FFNs expand, apply nonlinear transformation, and compress back.                            | Correct                       | This is the standard transformer MLP/FFN structure.                                                                                                                                                                                                                    | Low      |
| FFNs store semantic/factual structure.                                                     | Mostly correct but simplified | There is strong evidence FF layers act partly like key-value memories and mid-layer FF modules matter for factual associations, but knowledge is distributed, not a simple database. ([ACL Anthology][7])                                                              | Medium   |
| MoE models route tokens to selected experts.                                               | Correct                       | Mixtral uses sparse MoE layers where a router selects experts per token. ([arXiv][9])                                                                                                                                                                                  | Low      |
| Residual connections and normalization stabilize deep transformer training.                | Correct with simplification   | LayerNorm/RMSNorm are important stabilizers. RMSNorm keeps rescaling while dropping recentering, improving efficiency while remaining comparable. ([arXiv][16])                                                                                                        | Low      |
| LLM generation is next-token prediction repeated in a loop.                                | Correct                       | Autoregressive generation does work this way at inference. The nuance is that pretraining computes loss over many token positions efficiently in parallel.                                                                                                             | Low      |
| Decoding settings like temperature/top-p change output behavior.                           | Correct                       | This is standard sampling behavior. The article’s explanation is acceptable.                                                                                                                                                                                           | Low      |
| Speculative decoding can speed generation.                                                 | Correct                       | Speculative decoding is a real technique that can accelerate autoregressive sampling without changing the output distribution, under the method’s assumptions. ([arXiv][10])                                                                                           | Low      |
| Post-training exists after base pretraining.                                               | Correct but incomplete        | The article mentions it, but does not explain how much instruction tuning and human/preference feedback shape assistant behavior. ([arXiv][3])                                                                                                                         | Medium   |
| “Modern LLMs mostly share the same transformer-family skeleton.”                           | Mostly correct                | Broadly true for many major text LLMs, especially decoder-only transformer systems, but proprietary frontier details are not fully public and some newer systems use additional architectures, tools, retrieval, or multimodal components.                             | Medium   |
| Hallucinations are not central to the article.                                             | Incomplete                    | For a full “how LLMs work” explainer, hallucination mechanisms should be covered. OpenAI describes hallucinations as plausible but false model outputs and connects their persistence to training/evaluation incentives. ([OpenAI][14])                                | Medium   |

## Best corrected explanation

A modern text LLM usually starts by **tokenizing** input text into token IDs. Those IDs are mapped into learned vectors, combined with positional information, and passed through many transformer layers. Inside each layer, **masked self-attention** lets each token mix information from earlier tokens using learned query, key, and value projections; **feed-forward/MLP layers** then transform each token representation and provide much of the model’s capacity. Residual connections and normalization keep very deep stacks trainable. ([OpenAI Help Center][2])

During pretraining, the model’s weights are adjusted so it becomes better at predicting missing or next tokens across huge text/code/multimodal corpora. During inference, a decoder-style LLM repeatedly predicts a distribution over the next token, samples or selects one, appends it, and continues. KV caching, grouped-query attention, MoE routing, and speculative decoding are engineering/architecture techniques that make this process cheaper or faster. ([arXiv][8])

But deployed assistants are more than base next-token predictors. They are shaped by instruction tuning, preference/RLHF-style training, system prompts, safety policies, retrieval, tools, and product orchestration. They do not store a clean database of facts; knowledge and skills are distributed across weights and activations. Because they generate plausible continuations rather than directly checking truth by default, they can reason impressively but also hallucinate. ([arXiv][3])

## Reusable short answer

The article is mostly right — I’d call it **about 85% accurate**. It explains transformer LLM architecture well: tokens, embeddings, attention, FFNs, residual/norm layers, next-token generation, KV cache, GQA, and MoE are mostly described correctly. The main weakness is that it over-simplifies a few mechanisms and under-explains post-training, hallucinations, tools/RAG, and how deployed AI assistants are actually shaped. So it is a good article, but more precisely it explains **how transformer LLM architecture works**, not the full story of modern AI assistants.

## Sources

Main sources used: the audited article itself; OpenAI tokenization docs and `tiktoken`; transformer attention explanations; RoPE; Lost in the Middle; Mistral/Mixtral architecture papers; FFN memory and ROME papers; RMSNorm; InstructGPT/RLHF; speculative decoding; and OpenAI’s hallucination discussion. ([0xkato][1])

[1]: https://www.0xkato.xyz/how-llms-actually-work/ "How LLMs Actually Work | 0xkato"
[2]: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them "What are tokens and how to count them? | OpenAI Help Center"
[3]: https://arxiv.org/abs/2203.02155 "[2203.02155] Training language models to follow instructions with human feedback"
[4]: https://aclanthology.org/D18-2012/ "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing - ACL Anthology"
[5]: https://nlp.seas.harvard.edu/2018/04/03/attention.html "The Annotated Transformer"
[6]: https://arxiv.org/abs/2104.09864 "[2104.09864] RoFormer: Enhanced Transformer with Rotary Position Embedding"
[7]: https://aclanthology.org/2021.emnlp-main.446/ "Transformer Feed-Forward Layers Are Key-Value Memories - ACL Anthology"
[8]: https://arxiv.org/html/2310.06825 "Mistral 7B"
[9]: https://arxiv.org/abs/2401.04088 "[2401.04088] Mixtral of Experts"
[10]: https://arxiv.org/abs/2211.17192 "[2211.17192] Fast Inference from Transformers via Speculative Decoding"
[11]: https://arxiv.org/abs/1310.4546 "[1310.4546] Distributed Representations of Words and Phrases and their Compositionality"
[12]: https://arxiv.org/abs/2209.11895 "[2209.11895] In-context Learning and Induction Heads"
[13]: https://arxiv.org/abs/2202.05262 "[2202.05262] Locating and Editing Factual Associations in GPT"
[14]: https://openai.com/index/why-language-models-hallucinate/ "Why language models hallucinate | OpenAI"
[15]: https://github.com/openai/tiktoken "GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. · GitHub"
[16]: https://arxiv.org/abs/1910.07467 "[1910.07467] Root Mean Square Layer Normalization"











==================================================================================

V drugem GPT okolju, kjer GPT nima načina dela, sem spet vprašal GPT tole:

to preberi in poveč če to drži oz v % koliko % drži? 
https://www.0xkato.xyz/how-llms-actually-work/
če članka ne vidiš ti ga pripnem

==================================================================================

GPT je odgovoril tole:

Na kratko: **članek verjetno zadane približno 70–85 % resnice**, odvisno od tega, kako kategorične trditve daje. Če je njegovo glavno sporočilo:

> "LLM je v osnovi stroj za napoved naslednjega tokena, ki preko attention mehanizma gradi kompleksno vedenje."

potem je to precej pravilno. ([Knovo][1])

Ampak če članek trdi ali namiguje:

> "LLM je *samo* next-token predictor in nič več."

potem postane stvar precej bolj vprašljiva in bi rekel **50–60 % pravilno**. Tudi med raziskovalci je vedno več argumentov, da je to tehnično res na nivoju učnega cilja, ni pa dobra razlaga tega, kaj se je model dejansko naučil. ([arXiv][2])

Moja ocena po posameznih točkah:

| Trditev                                        | Moja ocena                                   |
| ---------------------------------------------- | -------------------------------------------- |
| LLM generira besedilo token za tokenom         | 100 %                                        |
| Trening temelji na napovedi naslednjega tokena | 100 %                                        |
| Attention je ključni mehanizem za kontekst     | 95 %                                         |
| Model nima baze dejstev kot SQL baza           | 95 %                                         |
| "Razumevanje" nastane kot emergentna lastnost  | 80–90 %                                      |
| LLM je samo zelo napreden autocomplete         | 40–60 %                                      |
| V njem ni nobenega notranjega modela sveta     | 20–50 %                                      |
| Razmišljanje je zgolj iluzija                  | 30–70 % (odvisno od definicije razmišljanja) |

Največji problem takšnih člankov je običajno zamenjava:

**učni cilj ≠ notranja struktura**

To je podobno kot:

> "Človek je samo organizem, ki maksimira reprodukcijo."

V evolucijskem smislu je to res. Ampak iz tega ne moreš razložiti matematike, glasbe, filozofije ali znanosti.

Podobno:

> "LLM napoveduje naslednji token."

To je res.

Ampak iz tega samodejno ne sledi, da v mreži ni nastala abstraktna reprezentacija konceptov, prostorskih odnosov, logike, vzročnosti itd. Ravno nasprotno – veliko raziskav kaže, da se med treningom oblikujejo zelo bogate notranje reprezentacije. ([arXiv][3])

Glede na to, kar si mi včeraj kazal za RHPr, RHP in AIm3, bi tebi verjetno najbolj padel v oči ravno ta del:

Članek verjetno opisuje **mehaniko generiranja**, ne pa **narave reprezentacije**.

To je podobno kot opisati možgane z:

> "Nevron sproži naslednji akcijski potencial."

Tehnično pravilno.

Ampak ne pojasni, kako iz tega nastanejo matematika, zavest ali kreativnost.

Če mi prilepiš originalni članek (ali PDF ali HTML), ga lahko preberem celega in ti dam precej bolj natančno oceno v stilu:

* kaj drži,
* kaj je poenostavljeno,
* kaj je zastarelo,
* kaj je sporno med raziskovalci,
* in predvsem kako se ujema ali ne ujema s tvojim RHPr/AIm3 pogledom.

[1]: https://www.knovo.dev/guides/how-llms-work?utm_source=chatgpt.com "How LLMs actually work: transformers, tokens, and attention explained (2026) | Knovo"
[2]: https://arxiv.org/abs/2408.04666?utm_source=chatgpt.com "LLMs are Not Just Next Token Predictors"
[3]: https://arxiv.org/abs/2408.13442?utm_source=chatgpt.com "A Law of Next-Token Prediction in Large Language Models"