RHPm · Multi‑LLM score arena · June 2026

How LLMs Actually Work — Scoreboard

A dark HTML comparison report for the RHPm multi‑LLM article experiment: original answers, Round‑2 critique, Round‑3 validation, final hybrid-of-hybrids, and the DeepSeek chat vs API split.

Temna HTML primerjalna stran za RHPm multi‑LLM eksperiment: originalni odgovori, Round‑2 kritika, Round‑3 validacija, finalni hibrid hibridov in primerjava DeepSeek chat proti DeepSeek API.

Language:
13 reviewers3 roundsDeepSeek counted twice: chat + APIFinal hybrid score: 97/100

Short verdict: yes, the original comparison report was too small. This page is the missing “ugly scoreboard” made readable: who scored what, where the hybrid improved, and where the models disagreed.

Round‑1 top original
97

Claude Fable 5 Max had the strongest original expert answer.

Round‑3 mean v2 score
92.7

Hybrid v2 became the consensus best base before the v3/public trim.

Final hybrid-of-hybrids
97

The final page combines Claude’s reader flow with GPT’s method discipline.

Round 1 — original same-prompt answers

These are our synthesis scores for each model’s first answer to the same RHPm-generated prompt. This is not “which model is best forever”; it is which response was most useful for this article build.

RankModelScoreVerdictContribution kept
1Claude Fable 5 Max
97
Best deep technical/system contributionPost-training, prompt injection, effective context, RLVR/reasoning compute, inference economics.
2GPT 5.5 Extended Pro
96
Best balanced article baseClean public draft with sources, transformer core, assistant stack, myth-busting.
3Kimi K2.6 Thinking
94
Best system-around-engine framingRequest-time pipeline: context assembly, tools, RAG, safety, output loop.
4Grok 4.3 Expert
93
Strongest blunt myth-bustingCorrected “stochastic parrot”, facts-as-database, RLHF truth myths.
5DeepSeek V4 Expert
91
Very strong clear explainerStrong model-vs-product framing and practical analogy; some source checks needed.
6Qwen 3.7 Plus
90
Strong technical/editorial contributorAttention/RoPE/SwiGLU notes, self-critique, source-needed checklist.
7MiniMax M3 Thinking
89
Good structured systems tableUseful component table and inference pipeline.
8Gemini 3.1 Pro
88
Readable and well structuredGood phase framing and engine-vs-assistant separation.
9GLM 5 DeepThink
86
Clear lay explanationAccessible tokens/embeddings/attention explanation.
10DeepSeek via API key at RHPm
85
Good API-run confirmation and usable draftCar/wheel-turner analogy; less source discipline; product claims need softening.
11MetaAI Thinking
82
Solid but genericClean basic model-vs-system split.
12Mistral LaChat Think
80
Readable but weaker synthesis materialAccessible language and myth list.
13Copilot Deeper Insights
74
Useful concise summary, weakest full answerShort, compressed, weaker depth/source quality.

Round 2 — models review Hybrid v1

Round 2 was the stress test. Several models liked the hybrid, but Claude, DeepSeek V4, and MiniMax exposed the key flaw: v1 was cleaner as a public draft but had lost too much technical/audit depth.

ReviewerHybrid v1 scoreOwn original scoreSelf deltaUsefulnessMost useful signal
GPT 5.5 Extended Pro
90
86
+8%92Praised structure; asked for citations, softer product claims, clearer BD workflow framing.
Claude Fable 5 Max
78
84
−5–10%98Most valuable critique: add prompt injection, hallucination incentives, base-model example, no-introspective-authority, CoT faithfulness, stronger math.
Gemini 3.1 Pro
92
84
+10%87Supported structure; wanted product-stack clarity and source discipline.
Grok 4.3 Expert
88
82
+12–15%89Wanted residual stream, normalization, GQA, equations, hallucination-objective nuance.
DeepSeek V4 Expert
82
92
−10–12%93Warned v1 lost technical rigor; pushed equations, hedges, source checks.
Qwen 3.7 Plus
92
85
+15%91Requested attention notation, PagedAttention, batching, KV-cache details.
Kimi K2.6 Thinking
85
78
+10–15%94Wanted math forward pass, source specificity, self-critique, audit transparency.
GLM 5 DeepThink
92
86
+15%88Suggested in-context learning, scaling-law note, inference detail.
MiniMax M3 Thinking
83
88
−5%95Sharp editor critique: add prompt injection, CoT faithfulness, base-model example, hallucination incentives, no-introspective-authority.
MetaAI Thinking
87
79
+10%86Wanted self-critique, effective-context caution, source-check specificity.
Mistral LaChat Think
92
88
+5%90Requested more technical depth, multimodality mechanics, self-critique.
Copilot Deeper Insights
90
78
+15–20%88Citation-light; asked for training-vs-inference distinction and product-feature softening.
DeepSeek via API key at RHPm
92
85
+8%89Praised balance and BD separation; suggested interpretability primer and context-window source checks.

Round 3 — models review Hybrid v2

After adding technical depth, citations/source anchors, prompt injection, hallucination incentives, CoT-faithfulness caveats, inference engineering and self-critique, Hybrid v2 became the clear consensus base.

ReviewerHybrid v2Hybrid v1Own Round‑1Main signal
GPT 5.5 Extended Pro
94
90
86
Best base; fix equation/soften hallucination incentives.
Claude Fable 5 Max
88
78
84
Crossover happened; add KV numbers and acronym definitions.
Gemini 3.1 Pro
96
92
84
Strong; trim meta-sections, add tokenizer/MoE/KV specifics.
Grok 4.3 Expert
94
88
82
Publish after small cuts; keep structure and BD separation.
DeepSeek V4 Expert
93
82
92
v2 beats original as public article; wants more technical precision.
Qwen 3.7 Plus
95
92
85
Definitive version; fix citations and ledger weight.
Kimi K2.6 Thinking
91
85
78
Best version; add interpretability, emergence caveat, date stamp.
GLM 5 DeepThink
95
92
86
Canonical reference; add residual/KV/quantization details.
MiniMax M3 Thinking
84
83
88
Only major dissent; meta ledger and human table hurt publication credibility.
MetaAI Thinking
92
87
79
Best so far; cut ledger, add product hedges and source checks.
Mistral LaChat Think
97
92
88
Near-flawless after small fixes.
Copilot Deeper Insights
94
90
78
Publishable with minor edits; add decoding/MoE/tool failure details.
DeepSeek via API
92
n/a
85
Hybrid better; wants interpretability and tokenization details.

Final hybrid-of-hybrids comparison

The final public page was not simply v3. It compared the GPT final page and Claude final page, then made a hybrid of those hybrids.

VersionScoreWhy
Claude HTML
95
Best reader-facing article: stronger flow, examples, references, and three-layer framing.
GPT HTML
92
Better method/source-spine structure and BD workflow separation, but less polished as a public page.
Hybrid of hybrids
97
Uses Claude as public article base, imports GPT method discipline, adds EN/SL toggle and BD/RHPm separation.

DeepSeek chat vs DeepSeek via API

DeepSeek appeared twice and should stay twice. The chat version and the RHPm API version behaved differently enough to be useful as separate signals.

CheckpointDeepSeek V4 Expert chatDeepSeek via RHPm APIInterpretation
Round 1 original-answer score
91
85
Chat V4 Expert was stronger: more complete and stricter technically. API run was usable but thinner.
Round 2: score given to Hybrid v1
82
92
Chat DeepSeek was much harsher and exposed real lost rigor. API DeepSeek was more positive/agreement-prone.
Round 2: own-original score
92
85
Chat DeepSeek rated its own answer far higher; API run was more modest.
Round 3: score given to Hybrid v2
93
92
Both validated v2. Chat barely preferred v2 to its own original; API clearly preferred v2 to its original.

Practical verdict: DeepSeek V4 Expert chat was the stronger reviewer and answer-writer. DeepSeek via API was valuable as a convenience/local-workflow test and produced a usable draft, but it was less rigorous and less source-disciplined in this run. For article quality and red-team critique, use chat. For fast local RHPm workflow testing, use API.

Method notes

  • Round‑1 scores are our external synthesis scores, not each model’s self-score.
  • Round‑2 and Round‑3 scores are reviewer scores extracted from model feedback.
  • The score tables are method evidence. They should not dominate the public article body, which is why the final article links here instead of embedding all tables.
  • The human baseline comparison belongs in the article as a scope comparison, not as a bragging table.

Kratek verdict: da, prvotni comparison report je bil premajhen. To je manjkajoča “grda scoreboard” stran, ampak narejena berljivo: kdo je kaj ocenil, kje se je hibrid izboljšal in kje so se modeli razhajali.

Najboljši Round‑1 original
97

Claude Fable 5 Max je imel najmočnejši originalni strokovni odgovor.

Povprečje v2 v Round‑3
92.7

Hybrid v2 je postal konsenzualno najboljša osnova pred v3/public trimom.

Finalni hibrid hibridov
97

Finalna stran združi Claudeov bralni tok in GPT-jevo metodno disciplino.

Round 1 — originalni odgovori na isti prompt

To so naše sintezne ocene prvih odgovorov na isti RHPm-generated prompt. To ni “kateri model je najboljši za vedno”, ampak kateri odgovor je bil najbolj uporaben za gradnjo tega članka.

RankModelOcenaVerdictKaj smo obdržali
1Claude Fable 5 Max
97
Best deep technical/system contributionPost-training, prompt injection, effective context, RLVR/reasoning compute, inference economics.
2GPT 5.5 Extended Pro
96
Best balanced article baseClean public draft with sources, transformer core, assistant stack, myth-busting.
3Kimi K2.6 Thinking
94
Best system-around-engine framingRequest-time pipeline: context assembly, tools, RAG, safety, output loop.
4Grok 4.3 Expert
93
Strongest blunt myth-bustingCorrected “stochastic parrot”, facts-as-database, RLHF truth myths.
5DeepSeek V4 Expert
91
Very strong clear explainerStrong model-vs-product framing and practical analogy; some source checks needed.
6Qwen 3.7 Plus
90
Strong technical/editorial contributorAttention/RoPE/SwiGLU notes, self-critique, source-needed checklist.
7MiniMax M3 Thinking
89
Good structured systems tableUseful component table and inference pipeline.
8Gemini 3.1 Pro
88
Readable and well structuredGood phase framing and engine-vs-assistant separation.
9GLM 5 DeepThink
86
Clear lay explanationAccessible tokens/embeddings/attention explanation.
10DeepSeek via API key at RHPm
85
Good API-run confirmation and usable draftCar/wheel-turner analogy; less source discipline; product claims need softening.
11MetaAI Thinking
82
Solid but genericClean basic model-vs-system split.
12Mistral LaChat Think
80
Readable but weaker synthesis materialAccessible language and myth list.
13Copilot Deeper Insights
74
Useful concise summary, weakest full answerShort, compressed, weaker depth/source quality.

Round 2 — modeli ocenijo Hybrid v1

Round 2 je bil stresni test. Več modelov je hibrid pohvalilo, Claude, DeepSeek V4 in MiniMax pa so ujeli glavno šibkost: v1 je bil čistejši javni draft, a je izgubil preveč tehnične/audit globine.

ReviewerHybrid v1Lastni originalDeltaKoristnostNajbolj uporaben signal
GPT 5.5 Extended Pro
90
86
+8%92Praised structure; asked for citations, softer product claims, clearer BD workflow framing.
Claude Fable 5 Max
78
84
−5–10%98Most valuable critique: add prompt injection, hallucination incentives, base-model example, no-introspective-authority, CoT faithfulness, stronger math.
Gemini 3.1 Pro
92
84
+10%87Supported structure; wanted product-stack clarity and source discipline.
Grok 4.3 Expert
88
82
+12–15%89Wanted residual stream, normalization, GQA, equations, hallucination-objective nuance.
DeepSeek V4 Expert
82
92
−10–12%93Warned v1 lost technical rigor; pushed equations, hedges, source checks.
Qwen 3.7 Plus
92
85
+15%91Requested attention notation, PagedAttention, batching, KV-cache details.
Kimi K2.6 Thinking
85
78
+10–15%94Wanted math forward pass, source specificity, self-critique, audit transparency.
GLM 5 DeepThink
92
86
+15%88Suggested in-context learning, scaling-law note, inference detail.
MiniMax M3 Thinking
83
88
−5%95Sharp editor critique: add prompt injection, CoT faithfulness, base-model example, hallucination incentives, no-introspective-authority.
MetaAI Thinking
87
79
+10%86Wanted self-critique, effective-context caution, source-check specificity.
Mistral LaChat Think
92
88
+5%90Requested more technical depth, multimodality mechanics, self-critique.
Copilot Deeper Insights
90
78
+15–20%88Citation-light; asked for training-vs-inference distinction and product-feature softening.
DeepSeek via API key at RHPm
92
85
+8%89Praised balance and BD separation; suggested interpretability primer and context-window source checks.

Round 3 — modeli ocenijo Hybrid v2

Po dodani tehnični globini, virih, prompt injection delu, razlagi halucinacij, CoT-faithfulness caveatu, inference engineeringu in self-critique je Hybrid v2 postal jasna konsenzualna osnova.

ReviewerHybrid v2Hybrid v1Lastni Round‑1Glavni signal
GPT 5.5 Extended Pro
94
90
86
Best base; fix equation/soften hallucination incentives.
Claude Fable 5 Max
88
78
84
Crossover happened; add KV numbers and acronym definitions.
Gemini 3.1 Pro
96
92
84
Strong; trim meta-sections, add tokenizer/MoE/KV specifics.
Grok 4.3 Expert
94
88
82
Publish after small cuts; keep structure and BD separation.
DeepSeek V4 Expert
93
82
92
v2 beats original as public article; wants more technical precision.
Qwen 3.7 Plus
95
92
85
Definitive version; fix citations and ledger weight.
Kimi K2.6 Thinking
91
85
78
Best version; add interpretability, emergence caveat, date stamp.
GLM 5 DeepThink
95
92
86
Canonical reference; add residual/KV/quantization details.
MiniMax M3 Thinking
84
83
88
Only major dissent; meta ledger and human table hurt publication credibility.
MetaAI Thinking
92
87
79
Best so far; cut ledger, add product hedges and source checks.
Mistral LaChat Think
97
92
88
Near-flawless after small fixes.
Copilot Deeper Insights
94
90
78
Publishable with minor edits; add decoding/MoE/tool failure details.
DeepSeek via API
92
n/a
85
Hybrid better; wants interpretability and tokenization details.

Finalna primerjava hibridov

Finalna javna stran ni bila samo v3. Najprej sta bila primerjana GPT final page in Claude final page, nato je nastal hibrid teh hibridov.

VerzijaOcenaZakaj
Claude HTML
95
Best reader-facing article: stronger flow, examples, references, and three-layer framing.
GPT HTML
92
Better method/source-spine structure and BD workflow separation, but less polished as a public page.
Hybrid of hybrids
97
Uses Claude as public article base, imports GPT method discipline, adds EN/SL toggle and BD/RHPm separation.

DeepSeek chat proti DeepSeek prek API

DeepSeek je bil dvakrat in prav je, da ostane dvakrat. Chat verzija in RHPm API verzija sta se obnašali dovolj različno, da sta koristna kot ločena signala.

Točka primerjaveDeepSeek V4 Expert chatDeepSeek prek RHPm APIInterpretacija
Round 1 original-answer score
91
85
Chat V4 Expert was stronger: more complete and stricter technically. API run was usable but thinner.
Round 2: score given to Hybrid v1
82
92
Chat DeepSeek was much harsher and exposed real lost rigor. API DeepSeek was more positive/agreement-prone.
Round 2: own-original score
92
85
Chat DeepSeek rated its own answer far higher; API run was more modest.
Round 3: score given to Hybrid v2
93
92
Both validated v2. Chat barely preferred v2 to its own original; API clearly preferred v2 to its original.

Praktičen verdict: DeepSeek V4 Expert chat je bil močnejši reviewer in pisec odgovora. DeepSeek prek API je bil uporaben kot test lokalnega/RHPm workflowa in je dal uporaben draft, ampak je bil v tej rundi manj strog in manj source-disciplined. Za kakovost članka in red-team kritiko uporabljaj chat. Za hitro lokalno RHPm testiranje uporabljaj API.

Metodne opombe

  • Round‑1 ocene so naše zunanje sintezne ocene, ne samoocene modelov.
  • Round‑2 in Round‑3 ocene so reviewer ocene iz modelskih feedbackov.
  • Score tabele so metodni dokaz. Ne smejo dominirati glavnega javnega članka, zato jih finalni članek linka sem, namesto da bi jih vse vgradil v telo članka.
  • Primerjava s človeškim baseline člankom sodi v glavni članek kot scope primerjava, ne kot samohvalna tabela.