AIm³ MentalArena
A human-led multi-LLM council OS plus a protocol fitness harness for improving the external reasoning workflow around today’s LLMs.
What it is
MentalArena is not a new LLM and not an AGI claim. It is an external thinking/workflow layer. It coordinates multiple LLMs through structured rounds, preserves memory and lineage, extracts conflicts, asks targeted follow-ups, gathers judge opinions, and produces a final synthesis or builder prompt.
1. Council Engine
The heart of the system. It runs the actual thinking loop:
BD seed → prompt expansion → LLM answers → conflict extraction → targeted critique → judge panel → final synthesis → next build/test
2. Protocol Fitness Harness
The measuring / CI / anti-bullshit layer. It tests protocol genes, baselines, ablations, holdout tasks, scoring regressions, and report quality. It does not replace LLM judgement.
3. Local Lab
A private working UI with API keys, local logs, provider calls, council runs, and exports. It is separate from this public page.
4. Public Page
This page explains the architecture only. It should never contain real API keys, private logs, or local working transcripts.
Workflow
1. BD writes a seed/problem. 2. MentalArena expands it using Plain / RHP / RHPr / RHPm / Council profiles. 3. Selected LLMs answer. 4. Python stores answers and builds a conflict map. 5. Round 2 asks targeted critique questions. 6. GPT/Claude/Gemini or other judges score and critique. 7. Final synthesis produces a document, builder prompt, code spec, or test plan. 8. Protocol Harness checks regressions and weak spots.
What Python does and does not judge
Python can measure structure: next actions, falsifiers, missing fields, repetition, provenance, logs, hashes, and basic anti-handwave checks. Python should not pretend to know deep truth alone. Semantic judgement comes from LLM judges plus BD’s final override.
Status
v0.6 splits the project into a clean public architecture page, a private Local Lab, and the existing Protocol Harness. The next proof step is real council runs using DeepSeek first, then manual or API-imported GPT/Claude/Gemini judge rounds.
- Current practical connector: DeepSeek-first local API path.
- Top-10 provider structure prepared: ChatGPT, Claude, DeepSeek, Gemini, Grok, GLM, MiniMax, Mistral, Kimi, Qwen.
- Real API keys stay local only.
Local Lab
The working lab is not meant to be public-hosted with keys. In the v0.6 package, open:
local_lab/RUN_LOCAL_MENTALARENA_LAB.bat http://127.0.0.1:8787/AIM3_MentalArena_Lab.html
For future Netlify deployment, provider calls should go through Netlify Functions and secrets, not browser-exposed keys.
v0.7 direction — Project Context Packs
New design rule: API models should not reason from a bare prompt when BD project materials matter. The private Lab can attach compact JSON context packs from local_lab/context/ so each LLM receives the same distilled project brain.
Context packs are small machine-readable capsules, not huge file dumps. They preserve source manifests, CTX ids, core principles, known failures, and current baselines. The current proof target remains narrow: conflict extraction must save BD time and preserve useful disagreement before larger automation is justified.