Short verdict: yes, the original comparison report was too small. This page is the missing “ugly scoreboard” made readable: who scored what, where the hybrid improved, and where the models disagreed.
Claude Fable 5 Max had the strongest original expert answer.
Hybrid v2 became the consensus best base before the v3/public trim.
The final page combines Claude’s reader flow with GPT’s method discipline.
Round 1 — original same-prompt answers
These are our synthesis scores for each model’s first answer to the same RHPm-generated prompt. This is not “which model is best forever”; it is which response was most useful for this article build.
| Rank | Model | Score | Verdict | Contribution kept |
|---|---|---|---|---|
| 1 | Claude Fable 5 Max | Best deep technical/system contribution | Post-training, prompt injection, effective context, RLVR/reasoning compute, inference economics. | |
| 2 | GPT 5.5 Extended Pro | Best balanced article base | Clean public draft with sources, transformer core, assistant stack, myth-busting. | |
| 3 | Kimi K2.6 Thinking | Best system-around-engine framing | Request-time pipeline: context assembly, tools, RAG, safety, output loop. | |
| 4 | Grok 4.3 Expert | Strongest blunt myth-busting | Corrected “stochastic parrot”, facts-as-database, RLHF truth myths. | |
| 5 | DeepSeek V4 Expert | Very strong clear explainer | Strong model-vs-product framing and practical analogy; some source checks needed. | |
| 6 | Qwen 3.7 Plus | Strong technical/editorial contributor | Attention/RoPE/SwiGLU notes, self-critique, source-needed checklist. | |
| 7 | MiniMax M3 Thinking | Good structured systems table | Useful component table and inference pipeline. | |
| 8 | Gemini 3.1 Pro | Readable and well structured | Good phase framing and engine-vs-assistant separation. | |
| 9 | GLM 5 DeepThink | Clear lay explanation | Accessible tokens/embeddings/attention explanation. | |
| 10 | DeepSeek via API key at RHPm | Good API-run confirmation and usable draft | Car/wheel-turner analogy; less source discipline; product claims need softening. | |
| 11 | MetaAI Thinking | Solid but generic | Clean basic model-vs-system split. | |
| 12 | Mistral LaChat Think | Readable but weaker synthesis material | Accessible language and myth list. | |
| 13 | Copilot Deeper Insights | Useful concise summary, weakest full answer | Short, compressed, weaker depth/source quality. |
Round 2 — models review Hybrid v1
Round 2 was the stress test. Several models liked the hybrid, but Claude, DeepSeek V4, and MiniMax exposed the key flaw: v1 was cleaner as a public draft but had lost too much technical/audit depth.
| Reviewer | Hybrid v1 score | Own original score | Self delta | Usefulness | Most useful signal |
|---|---|---|---|---|---|
| GPT 5.5 Extended Pro | +8% | 92 | Praised structure; asked for citations, softer product claims, clearer BD workflow framing. | ||
| Claude Fable 5 Max | −5–10% | 98 | Most valuable critique: add prompt injection, hallucination incentives, base-model example, no-introspective-authority, CoT faithfulness, stronger math. | ||
| Gemini 3.1 Pro | +10% | 87 | Supported structure; wanted product-stack clarity and source discipline. | ||
| Grok 4.3 Expert | +12–15% | 89 | Wanted residual stream, normalization, GQA, equations, hallucination-objective nuance. | ||
| DeepSeek V4 Expert | −10–12% | 93 | Warned v1 lost technical rigor; pushed equations, hedges, source checks. | ||
| Qwen 3.7 Plus | +15% | 91 | Requested attention notation, PagedAttention, batching, KV-cache details. | ||
| Kimi K2.6 Thinking | +10–15% | 94 | Wanted math forward pass, source specificity, self-critique, audit transparency. | ||
| GLM 5 DeepThink | +15% | 88 | Suggested in-context learning, scaling-law note, inference detail. | ||
| MiniMax M3 Thinking | −5% | 95 | Sharp editor critique: add prompt injection, CoT faithfulness, base-model example, hallucination incentives, no-introspective-authority. | ||
| MetaAI Thinking | +10% | 86 | Wanted self-critique, effective-context caution, source-check specificity. | ||
| Mistral LaChat Think | +5% | 90 | Requested more technical depth, multimodality mechanics, self-critique. | ||
| Copilot Deeper Insights | +15–20% | 88 | Citation-light; asked for training-vs-inference distinction and product-feature softening. | ||
| DeepSeek via API key at RHPm | +8% | 89 | Praised balance and BD separation; suggested interpretability primer and context-window source checks. |
Round 3 — models review Hybrid v2
After adding technical depth, citations/source anchors, prompt injection, hallucination incentives, CoT-faithfulness caveats, inference engineering and self-critique, Hybrid v2 became the clear consensus base.
| Reviewer | Hybrid v2 | Hybrid v1 | Own Round‑1 | Main signal |
|---|---|---|---|---|
| GPT 5.5 Extended Pro | Best base; fix equation/soften hallucination incentives. | |||
| Claude Fable 5 Max | Crossover happened; add KV numbers and acronym definitions. | |||
| Gemini 3.1 Pro | Strong; trim meta-sections, add tokenizer/MoE/KV specifics. | |||
| Grok 4.3 Expert | Publish after small cuts; keep structure and BD separation. | |||
| DeepSeek V4 Expert | v2 beats original as public article; wants more technical precision. | |||
| Qwen 3.7 Plus | Definitive version; fix citations and ledger weight. | |||
| Kimi K2.6 Thinking | Best version; add interpretability, emergence caveat, date stamp. | |||
| GLM 5 DeepThink | Canonical reference; add residual/KV/quantization details. | |||
| MiniMax M3 Thinking | Only major dissent; meta ledger and human table hurt publication credibility. | |||
| MetaAI Thinking | Best so far; cut ledger, add product hedges and source checks. | |||
| Mistral LaChat Think | Near-flawless after small fixes. | |||
| Copilot Deeper Insights | Publishable with minor edits; add decoding/MoE/tool failure details. | |||
| DeepSeek via API | n/a | Hybrid better; wants interpretability and tokenization details. |
Final hybrid-of-hybrids comparison
The final public page was not simply v3. It compared the GPT final page and Claude final page, then made a hybrid of those hybrids.
| Version | Score | Why |
|---|---|---|
| Claude HTML | Best reader-facing article: stronger flow, examples, references, and three-layer framing. | |
| GPT HTML | Better method/source-spine structure and BD workflow separation, but less polished as a public page. | |
| Hybrid of hybrids | Uses Claude as public article base, imports GPT method discipline, adds EN/SL toggle and BD/RHPm separation. |
DeepSeek chat vs DeepSeek via API
DeepSeek appeared twice and should stay twice. The chat version and the RHPm API version behaved differently enough to be useful as separate signals.
| Checkpoint | DeepSeek V4 Expert chat | DeepSeek via RHPm API | Interpretation |
|---|---|---|---|
| Round 1 original-answer score | Chat V4 Expert was stronger: more complete and stricter technically. API run was usable but thinner. | ||
| Round 2: score given to Hybrid v1 | Chat DeepSeek was much harsher and exposed real lost rigor. API DeepSeek was more positive/agreement-prone. | ||
| Round 2: own-original score | Chat DeepSeek rated its own answer far higher; API run was more modest. | ||
| Round 3: score given to Hybrid v2 | Both validated v2. Chat barely preferred v2 to its own original; API clearly preferred v2 to its original. |
Practical verdict: DeepSeek V4 Expert chat was the stronger reviewer and answer-writer. DeepSeek via API was valuable as a convenience/local-workflow test and produced a usable draft, but it was less rigorous and less source-disciplined in this run. For article quality and red-team critique, use chat. For fast local RHPm workflow testing, use API.
Method notes
- Round‑1 scores are our external synthesis scores, not each model’s self-score.
- Round‑2 and Round‑3 scores are reviewer scores extracted from model feedback.
- The score tables are method evidence. They should not dominate the public article body, which is why the final article links here instead of embedding all tables.
- The human baseline comparison belongs in the article as a scope comparison, not as a bragging table.
Kratek verdict: da, prvotni comparison report je bil premajhen. To je manjkajoča “grda scoreboard” stran, ampak narejena berljivo: kdo je kaj ocenil, kje se je hibrid izboljšal in kje so se modeli razhajali.
Claude Fable 5 Max je imel najmočnejši originalni strokovni odgovor.
Hybrid v2 je postal konsenzualno najboljša osnova pred v3/public trimom.
Finalna stran združi Claudeov bralni tok in GPT-jevo metodno disciplino.
Round 1 — originalni odgovori na isti prompt
To so naše sintezne ocene prvih odgovorov na isti RHPm-generated prompt. To ni “kateri model je najboljši za vedno”, ampak kateri odgovor je bil najbolj uporaben za gradnjo tega članka.
| Rank | Model | Ocena | Verdict | Kaj smo obdržali |
|---|---|---|---|---|
| 1 | Claude Fable 5 Max | Best deep technical/system contribution | Post-training, prompt injection, effective context, RLVR/reasoning compute, inference economics. | |
| 2 | GPT 5.5 Extended Pro | Best balanced article base | Clean public draft with sources, transformer core, assistant stack, myth-busting. | |
| 3 | Kimi K2.6 Thinking | Best system-around-engine framing | Request-time pipeline: context assembly, tools, RAG, safety, output loop. | |
| 4 | Grok 4.3 Expert | Strongest blunt myth-busting | Corrected “stochastic parrot”, facts-as-database, RLHF truth myths. | |
| 5 | DeepSeek V4 Expert | Very strong clear explainer | Strong model-vs-product framing and practical analogy; some source checks needed. | |
| 6 | Qwen 3.7 Plus | Strong technical/editorial contributor | Attention/RoPE/SwiGLU notes, self-critique, source-needed checklist. | |
| 7 | MiniMax M3 Thinking | Good structured systems table | Useful component table and inference pipeline. | |
| 8 | Gemini 3.1 Pro | Readable and well structured | Good phase framing and engine-vs-assistant separation. | |
| 9 | GLM 5 DeepThink | Clear lay explanation | Accessible tokens/embeddings/attention explanation. | |
| 10 | DeepSeek via API key at RHPm | Good API-run confirmation and usable draft | Car/wheel-turner analogy; less source discipline; product claims need softening. | |
| 11 | MetaAI Thinking | Solid but generic | Clean basic model-vs-system split. | |
| 12 | Mistral LaChat Think | Readable but weaker synthesis material | Accessible language and myth list. | |
| 13 | Copilot Deeper Insights | Useful concise summary, weakest full answer | Short, compressed, weaker depth/source quality. |
Round 2 — modeli ocenijo Hybrid v1
Round 2 je bil stresni test. Več modelov je hibrid pohvalilo, Claude, DeepSeek V4 in MiniMax pa so ujeli glavno šibkost: v1 je bil čistejši javni draft, a je izgubil preveč tehnične/audit globine.
| Reviewer | Hybrid v1 | Lastni original | Delta | Koristnost | Najbolj uporaben signal |
|---|---|---|---|---|---|
| GPT 5.5 Extended Pro | +8% | 92 | Praised structure; asked for citations, softer product claims, clearer BD workflow framing. | ||
| Claude Fable 5 Max | −5–10% | 98 | Most valuable critique: add prompt injection, hallucination incentives, base-model example, no-introspective-authority, CoT faithfulness, stronger math. | ||
| Gemini 3.1 Pro | +10% | 87 | Supported structure; wanted product-stack clarity and source discipline. | ||
| Grok 4.3 Expert | +12–15% | 89 | Wanted residual stream, normalization, GQA, equations, hallucination-objective nuance. | ||
| DeepSeek V4 Expert | −10–12% | 93 | Warned v1 lost technical rigor; pushed equations, hedges, source checks. | ||
| Qwen 3.7 Plus | +15% | 91 | Requested attention notation, PagedAttention, batching, KV-cache details. | ||
| Kimi K2.6 Thinking | +10–15% | 94 | Wanted math forward pass, source specificity, self-critique, audit transparency. | ||
| GLM 5 DeepThink | +15% | 88 | Suggested in-context learning, scaling-law note, inference detail. | ||
| MiniMax M3 Thinking | −5% | 95 | Sharp editor critique: add prompt injection, CoT faithfulness, base-model example, hallucination incentives, no-introspective-authority. | ||
| MetaAI Thinking | +10% | 86 | Wanted self-critique, effective-context caution, source-check specificity. | ||
| Mistral LaChat Think | +5% | 90 | Requested more technical depth, multimodality mechanics, self-critique. | ||
| Copilot Deeper Insights | +15–20% | 88 | Citation-light; asked for training-vs-inference distinction and product-feature softening. | ||
| DeepSeek via API key at RHPm | +8% | 89 | Praised balance and BD separation; suggested interpretability primer and context-window source checks. |
Round 3 — modeli ocenijo Hybrid v2
Po dodani tehnični globini, virih, prompt injection delu, razlagi halucinacij, CoT-faithfulness caveatu, inference engineeringu in self-critique je Hybrid v2 postal jasna konsenzualna osnova.
| Reviewer | Hybrid v2 | Hybrid v1 | Lastni Round‑1 | Glavni signal |
|---|---|---|---|---|
| GPT 5.5 Extended Pro | Best base; fix equation/soften hallucination incentives. | |||
| Claude Fable 5 Max | Crossover happened; add KV numbers and acronym definitions. | |||
| Gemini 3.1 Pro | Strong; trim meta-sections, add tokenizer/MoE/KV specifics. | |||
| Grok 4.3 Expert | Publish after small cuts; keep structure and BD separation. | |||
| DeepSeek V4 Expert | v2 beats original as public article; wants more technical precision. | |||
| Qwen 3.7 Plus | Definitive version; fix citations and ledger weight. | |||
| Kimi K2.6 Thinking | Best version; add interpretability, emergence caveat, date stamp. | |||
| GLM 5 DeepThink | Canonical reference; add residual/KV/quantization details. | |||
| MiniMax M3 Thinking | Only major dissent; meta ledger and human table hurt publication credibility. | |||
| MetaAI Thinking | Best so far; cut ledger, add product hedges and source checks. | |||
| Mistral LaChat Think | Near-flawless after small fixes. | |||
| Copilot Deeper Insights | Publishable with minor edits; add decoding/MoE/tool failure details. | |||
| DeepSeek via API | n/a | Hybrid better; wants interpretability and tokenization details. |
Finalna primerjava hibridov
Finalna javna stran ni bila samo v3. Najprej sta bila primerjana GPT final page in Claude final page, nato je nastal hibrid teh hibridov.
| Verzija | Ocena | Zakaj |
|---|---|---|
| Claude HTML | Best reader-facing article: stronger flow, examples, references, and three-layer framing. | |
| GPT HTML | Better method/source-spine structure and BD workflow separation, but less polished as a public page. | |
| Hybrid of hybrids | Uses Claude as public article base, imports GPT method discipline, adds EN/SL toggle and BD/RHPm separation. |
DeepSeek chat proti DeepSeek prek API
DeepSeek je bil dvakrat in prav je, da ostane dvakrat. Chat verzija in RHPm API verzija sta se obnašali dovolj različno, da sta koristna kot ločena signala.
| Točka primerjave | DeepSeek V4 Expert chat | DeepSeek prek RHPm API | Interpretacija |
|---|---|---|---|
| Round 1 original-answer score | Chat V4 Expert was stronger: more complete and stricter technically. API run was usable but thinner. | ||
| Round 2: score given to Hybrid v1 | Chat DeepSeek was much harsher and exposed real lost rigor. API DeepSeek was more positive/agreement-prone. | ||
| Round 2: own-original score | Chat DeepSeek rated its own answer far higher; API run was more modest. | ||
| Round 3: score given to Hybrid v2 | Both validated v2. Chat barely preferred v2 to its own original; API clearly preferred v2 to its original. |
Praktičen verdict: DeepSeek V4 Expert chat je bil močnejši reviewer in pisec odgovora. DeepSeek prek API je bil uporaben kot test lokalnega/RHPm workflowa in je dal uporaben draft, ampak je bil v tej rundi manj strog in manj source-disciplined. Za kakovost članka in red-team kritiko uporabljaj chat. Za hitro lokalno RHPm testiranje uporabljaj API.
Metodne opombe
- Round‑1 ocene so naše zunanje sintezne ocene, ne samoocene modelov.
- Round‑2 in Round‑3 ocene so reviewer ocene iz modelskih feedbackov.
- Score tabele so metodni dokaz. Ne smejo dominirati glavnega javnega članka, zato jih finalni članek linka sem, namesto da bi jih vse vgradil v telo članka.
- Primerjava s človeškim baseline člankom sodi v glavni članek kot scope primerjava, ne kot samohvalna tabela.